A month ago, I started participating in my first Kaggle competition. I wanted to start participating in Kaggle competitions for a while and Facebook launched a recruiting competition: *Human or Robot?*, and I decided to join the party.

I was finally ranked 32<sup>nd</sup> on the final ranking (private learderboard), and I could actually have been ranked 17<sup>th</sup> if I had chosen another of my predictions as final submission... (That additionnal URL feature wasn't that useless afterall...)

Well... Let's get started!

# Import needed libraries

Nothing fancy here, we just use some classical Python Data Science libraries: numpy, scikit-learn and pandas, plus pickle to save the result.

In [1]:
import pickle
import numpy as np
import sklearn.preprocessing
from sklearn_pandas import DataFrameMapper 
import pandas as pd

# Read the data

First step is to read the CSV files and load them as Pandas frames. I used Pandas here because of the "heavy" work needed to create the features, being very easy to do with Pandas and that would have been much more painful to do with numpy only for example.

In [2]:
# Read the bids, replace NaN values with a dash because NaN are string categories we don't know
bids = pd.read_csv('bids.csv', header=0)
bids.fillna('-', inplace=True)

# Load train and test bidders lists (nothing really interesting in that list)
train = pd.read_csv('train.csv', header=0, index_col=0)
test = pd.read_csv('test.csv', header=0, index_col=0)

# Join those 2 datasets together (-1 outcome meaning unknown i.e. test)
test['outcome'] = -1.0
bidders = pd.concat((train, test))

# Dataset investigation

Prior to the feature creation, some inspection on the data has been done, of course, but I didn't kept any trace of this quick'n'dirty work anywhere and won't be able to show it to you.

The idea is to get to know what's in the dataset. The first step is just to show look at the raw data, and to then look at some stats about the data. For example, the first thing you might wonder is what the heck can I do with those *payment_account* and *address* hashes that Kaggle gave me for each bidder. Is there any of these things that appear more than once? Nope, they are as unique as the *bidder_id* so you can throw this away. At this moment you know that all your features must come from the bids themselves.

Then you do some stats and plots of what's in the bids, you look what could be interesting to compute features...

# Features creation

... and then there is a moment when you decide that it's time to get this party started and start to do something with these bids.

The things I hesitated about was to decide if I should predict if a bidder is a bot based on aggregates of info about his bids, or if I should predict if a bid has been made by a bot for each bid and then aggregate the predictions at bid level to create a prediction at bidder level.

I preferred to work at the bidder level because I had the feeling that each bid don't have enough info by itself to allow proper prediction, and that the aggregation of a set of bids would allow me to create more high level features about the general behavior of a bidder and therefore get more useful info.

I didn't have the time to actually try the bid-level prediction approach. I don't know how it would have turned out.

So the first thing you do is to look at the info you have on each bid and look at what feature you can compute with this. We have this:

* **auction** (category) – Unique identifier of an auction
* **merchandise** (category) –  The category of the auction site campaign, which means the bidder might come to this site by way of searching for "home goods" but ended up bidding for "sporting goods" - and that leads to this field being "home goods". This categorical field could be a search term, or online advertisement. 
* **device** (category) – Phone model of a visitor
* **time** (real) - Time that the bid is made (transformed to protect privacy).
* **country** (category) - The country that the IP belongs to
* **ip** (category) – IP address of a bidder (obfuscated to protect privacy).
* **url** (category) - url where the bidder was referred from (obfuscated to protect privacy).

We only have one real feature and lots of categories.

My first approach was what I've seen called "kitchen sink approach", I basically decided to compute whatever statistical computation crossed my mind (as long as I wasn't too lazy to implement it so it had better be simple or genius). I decided to apply the same analysis on all categories. And the same idea goes for the real variable, with different stats of course.

## Category variable feature extraction

The idea is to group the bids of a bidder and compute stats about the group (which is therefore a series of value for each variable). Then for a list of string, you get stats about these strings and their frequencies:

* number of unique categories that appear
* highest frequency that appearance
* lowest frequency of appearance
* category that appear the most
* standard deviation of the frequencies

and... that's it!

In [3]:
def computeStatsCat(series, normalizeCount = 1.0):
    
    n = float(series.shape[0])
    counts = series.value_counts()
    
    nbUnique = counts.count() / normalizeCount
    hiFreq = counts[0] / n
    loFreq = counts[-1] / n
    argmax = counts.index[0]
    stdFreq = np.std(counts / n)
    
    return (nbUnique, loFreq, hiFreq, stdFreq, argmax)


## Real variable feature extraction

For time, I decided to go a little bit deeper and group the series of timestamp of bids by auction and have two stages of stats, stats at auction level that are then aggregated and global stats for the whole set of timestamps of the bidder.

You see below a few functions that allowed me to compute those features. Basically, for each auction I compute stats that are: min, max, range of timestamps, and then I compute a bunch of things about interval between two bids: mean interval, standard deviation, percentiles. I then aggregate those results for all the auction a bidder had always in the same ideas of simple stats.

In [4]:
# Compute stats of numerical series without caring about interval between values
def computeStatsNumNoIntervals(series):
    
    min = series.min()
    max = series.max()
    mean = np.mean(series)
    std = np.std(series)
    perc20 = np.percentile(series, 20)
    perc50 = np.percentile(series, 50)
    perc80 = np.percentile(series, 80)
    
    return (min, max, mean, std, perc20, perc50, perc80)

# Compute stats of a numerical series, taking intervals between values into account
def computeStatsNum(series, copy = True):
    
    if copy:
        series = series.copy()

    series.sort()
    intervals = series[1:].as_matrix() - series[:-1].as_matrix()
    if len(intervals) < 1:
        intervals = np.array([0])

    nb = series.shape[0]
    min = series.min()
    max = series.max()
    range = max - min
    intervalsMin = np.min(intervals)
    intervalsMax = np.max(intervals)
    intervalsMean = np.mean(intervals)
    intervalsStd = np.std(intervals)
    intervals25 = np.percentile(intervals, 25)
    intervals50 = np.percentile(intervals, 50)
    intervals75 = np.percentile(intervals, 75)
    
    return (nb, min, max, range,
            intervalsMin, intervalsMax, intervalsMean, intervalsStd,
            intervals25, intervals50, intervals75)

# Compute stats about a numerical column of table, with stats on sub-groups of this column (auctions in our case).
def computeStatsNumWithGroupBy(table, column, groupby):
    
    # get series and groups
    series = table[column]
    groups = table.groupby(groupby)
    
    # global stats
    (nb, min, max, range,
    intervalsMin, intervalsMax, intervalsMean, intervalsStd,
    intervals25, intervals50, intervals75) = computeStatsNum(series)
    
    # stats by group
    X = []
    for _, group in groups:
        (grpNb, _, _, grpRange, grpIntervalsMin, grpIntervalsMax, grpIntervalsMean, grpIntervalsStd, _, _, _) = computeStatsNum(group[column])
        X.append([grpNb, grpRange, grpIntervalsMin, grpIntervalsMax, grpIntervalsMean, grpIntervalsStd])
    X = np.array(X)
    
    grpNbMean = np.mean(X[:,0])
    grpNbStd = np.std(X[:,0])
    grpRangeMean = np.mean(X[:,1])
    grpRangeStd = np.std(X[:,1])
    grpIntervalsMinMin = np.min(X[:,2])
    grpIntervalsMinMean = np.mean(X[:,2])
    grpIntervalsMaxMax = np.max(X[:,3])
    grpIntervalsMaxMean = np.mean(X[:,3])
    grpIntervalsMean = np.mean(X[:,4])
    grpIntervalsMeanStd = np.std(X[:,4])
    grpIntervalsStd = np.mean(X[:,5])
    
    return (nb, min, max, range,
            intervalsMin, intervalsMax, intervalsMean, intervalsStd,
            intervals25, intervals50, intervals75,
            grpNbMean, grpNbStd, grpRangeMean, grpRangeStd,
            grpIntervalsMinMin, grpIntervalsMinMean, grpIntervalsMaxMax, grpIntervalsMaxMean,
            grpIntervalsMean, grpIntervalsMeanStd, grpIntervalsStd)

## Feature tried that did not really worked

### From categories to real values

In a desperate attempt to increase my score, I though about replacing categories in the *bids* dataset by real values by computing general stats about each category of each variable, and replacing this category by stats about this category, in my case the probability of this category to belong to appear in a bot's bid.

I am aware that this is getting close to the danger of [Data Leakage](https://www.kaggle.com/wiki/Leakage) because you are explicitly introducing information about the target in the features. However, I feel like because the real value you use to represent the category is computed on the whole dataset, it might in some cases be ok because it is a very aggregated info, provided you have a lot of data in each category.

In this case, I think that if was definitely a data leakage because I got a 0.97 AUC on my CV but a 0.86 score on public leaderboard (a big drop from my results without those features). But it was worth trying!

In [5]:
def computeOutcomeProbaByCat(data, cats):
    stats = {}
    for cat in cats:
        stats[cat] = pd.DataFrame(data.groupby(cat).aggregate(np.mean).outcome)
        stats[cat].rename(columns={'outcome': cat+'Num'}, inplace=True)
    return stats

#bidsWithOutcome = pd.merge(bids, bidders[['outcome']], how='left', left_on='bidder_id', right_index=True)
#stats = computeOutcomeProbaByCat(bidsWithOutcome[bidsWithOutcome.outcome >= 0], [u'auction', u'merchandise', u'device', u'country', u'url'])

# Add real columns to bids dataframe
#for cat in stats:
#    bids = pd.merge(bids, stats[cat], how='left', left_on=cat, right_index=True)

### Make a special case for merchandise category

I also wanted to make a special case for merchandise category, and have a couple of features per category indicating in a way how the bidder participated in the auctions of this merchandise: the number of bids in the category and the percentage of his bids made in this category.

This did not changed my score in any way, probably due to the fact that very few bidders actually participate in multiple merchandises if I remember well.

## Features I didn't tried

There are a lot of features I could have tried if I had time and motivation. You will find a lot of different things in others feedback from this contest. There are a lot of great and nice features I do not have, however, it seems that what I got here already gives you pretty good results.

## About features interpretation

Lots of people like to look at the contribution of each feature in the final classifiers. I'm sure it might give some information and ideas about how to improve your features. I didn't do it in this competition. And I'm anyway not a big fan of interpreting a Machine Learning model, something also "criticized" in the [great kdnuggets blog article "The Myth of Model Interpretability"](http://www.kdnuggets.com/2015/04/model-interpretability-neural-networks-deep-learning.html).

## Global computation of features

Well, finally we need to compute all those features from our dataset so, I know this block of code is kind of dirty, but since I wanted to be able to include of exclude features at ease, this was my solution. This is the moment when multi-cursors feature of Sublime Text takes stats being very useful.

In [6]:
# Init vars
Xids = []
X = []

# Old init for stats about merchadises
# merchandises = bids.merchandise.value_counts()

# For each bidder
for bidder, group in bids.groupby('bidder_id'):
    
    # Compute the stats
    (nbUniqueIP, loFreqIP, hiFreqIP, stdFreqIP, IP) = computeStatsCat(group.ip)
    (nbUniqueDevice, loFreqDevice, hiFreqDevice, stdFreqDevice, device) = computeStatsCat(group.device)
    (nbUniqueMerch, loFreqMerch, hiFreqMerch, stdFreqMerch, merch) = computeStatsCat(group.merchandise)
    (nbUniqueCountry, loFreqCountry, hiFreqCountry, stdFreqCountry, country) = computeStatsCat(group.country)
    (nbUniqueUrl, loFreqUrl, hiFreqUrl, stdFreqUrl, url) = computeStatsCat(group.url)
    (nbUniqueAuction, loFreqAuction, hiFreqAuction, stdFreqAuction, auction) = computeStatsCat(group.auction)
    (auctionNb, auctionMin, auctionMax, auctionRange,
    auctionIntervalsMin, auctionIntervalsMax, auctionIntervalsMean, auctionIntervalsStd,
    auctionIntervals25, auctionIntervals50, auctionIntervals75,
    auctionGrpNbMean, auctionGrpNbStd, auctionGrpRangeMean, auctionGrpRangeStd,
    auctionGrpIntervalsMinMin, auctionGrpIntervalsMinMean, auctionGrpIntervalsMaxMax, auctionGrpIntervalsMaxMean,
    auctionGrpIntervalsMean, auctionGrpIntervalsMeanStd, auctionGrpIntervalsStd) = computeStatsNumWithGroupBy(group, 'time', 'auction')
    
    # Save the stats
    # Also I don't really remember which category features I kept or not in my final submission :$
    # I think it was IP + device + merch + contry, but for computation time let's comment some of these
    x = [nbUniqueIP, loFreqIP, hiFreqIP, stdFreqIP, #IP,
          nbUniqueDevice, loFreqDevice, hiFreqDevice, stdFreqDevice, #device,
          nbUniqueMerch, loFreqMerch, hiFreqMerch, stdFreqMerch, merch,
          nbUniqueCountry, loFreqCountry, hiFreqCountry, stdFreqCountry, country,
          nbUniqueUrl, loFreqUrl, hiFreqUrl, stdFreqUrl, #url,
          nbUniqueAuction, loFreqAuction, hiFreqAuction, stdFreqAuction, #auction
          auctionNb, auctionMin, auctionMax, auctionRange,
          auctionIntervalsMin, auctionIntervalsMax, auctionIntervalsMean, auctionIntervalsStd,
          auctionIntervals25, auctionIntervals50, auctionIntervals75,
          auctionGrpNbMean, auctionGrpNbStd, auctionGrpRangeMean, auctionGrpRangeStd,
          auctionGrpIntervalsMinMin, auctionGrpIntervalsMinMean, auctionGrpIntervalsMaxMax, auctionGrpIntervalsMaxMean,
          auctionGrpIntervalsMean, auctionGrpIntervalsMeanStd, auctionGrpIntervalsStd]
    
    ## Old stats per merchandise
    # for key in merchandisesCounts.index:
    #     merchandisesTmp[key] = merchandisesCounts[key]
    #     merchandisesTmp2[key] = float(merchandisesCounts[key]) / len(group)
    # merchandisesTmp = merchandises * 0
    # merchandisesTmp2 = (merchandises * 0).astype('float')
    # merchandisesCounts = group.merchandise.value_counts()
    # x += merchandisesTmp.tolist();
    # x += merchandisesTmp2.tolist();
    
    # Old stats replacing using real value substitution of categories
    # catCols = []
    # for cat in stats:
    #     (catMin, catMax, catMean, catStd, catPerc20, catPerc50, catPerc80) = computeStatsNumNoIntervals(group[cat+'Num'])
    #     x += [catMin, catMax, catMean, catStd, catPerc20, catPerc50, catPerc80]

    # Save the stats in the result arrays
    Xids.append(bidder)
    X.append(x)

# Features labels
Xcols = ['nbUniqueIP', 'loFreqIP', 'hiFreqIP', 'stdFreqIP', #'IP',
              'nbUniqueDevice', 'loFreqDevice', 'hiFreqDevice', 'stdFreqDevice', #'device',
              'nbUniqueMerch', 'loFreqMerch', 'hiFreqMerch', 'stdFreqMerch', 'merch',
              'nbUniqueCountry', 'loFreqCountry', 'hiFreqCountry', 'stdFreqCountry', 'country',
              'nbUniqueUrl', 'loFreqUrl', 'hiFreqUrl', 'stdFreqUrl', #'url',
              'nbUniqueAuction', 'loFreqAuction', 'hiFreqAuction', 'stdFreqAuction','auctionNb', 'auctionMin', 'auctionMax', 'auctionRange',
              'auctionIntervalsMin', 'auctionIntervalsMax', 'auctionIntervalsMean', 'auctionIntervalsStd',
              'auctionIntervals25', 'auctionIntervals50', 'auctionIntervals75',
              'auctionGrpNbMean', 'auctionGrpNbStd', 'auctionGrpRangeMean', 'auctionGrpRangeStd',
              'auctionGrpIntervalsMinMin', 'auctionGrpIntervalsMinMean', 'auctionGrpIntervalsMaxMax', 'auctionGrpIntervalsMaxMean',
              'auctionGrpIntervalsMean', 'auctionGrpIntervalsMeanStd', 'auctionGrpIntervalsStd']

# Old features labels when replacing using real value substitution of categories
# for cat in stats:
#     Xcols += [cat + 'NumMin', cat + 'NumMax', cat + 'NumMean', cat + 'NumStd', cat + 'NumPerc20', cat + 'NumPerc50', cat + 'NumPerc80']
# Xcols += map(lambda x: "merch" + x + "Abs", merchandisesTmp.keys().tolist())
# Xcols += map(lambda x: "merch" + x + "Prop", merchandisesTmp.keys().tolist())

# Create a pandas dataset, remove NaN from dataset and show it
dataset = pd.DataFrame(X,index=Xids, columns=Xcols)
dataset.fillna(0.0, inplace=True)
dataset

Unnamed: 0,nbUniqueIP,loFreqIP,hiFreqIP,stdFreqIP,nbUniqueDevice,loFreqDevice,hiFreqDevice,stdFreqDevice,nbUniqueMerch,loFreqMerch,...,auctionGrpNbStd,auctionGrpRangeMean,auctionGrpRangeStd,auctionGrpIntervalsMinMin,auctionGrpIntervalsMinMean,auctionGrpIntervalsMaxMax,auctionGrpIntervalsMaxMean,auctionGrpIntervalsMean,auctionGrpIntervalsMeanStd,auctionGrpIntervalsStd
001068c415025a009fee375a12cff4fcnht8y,1,1.000000,1.000000,0.000000e+00,1,1.000000,1.000000,0.000000e+00,1,1,...,0.000000,0.000000e+00,0.000000e+00,0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
002d229ffb247009810828f648afc2ef593rb,1,1.000000,1.000000,0.000000e+00,2,0.500000,0.500000,0.000000e+00,1,1,...,0.000000,1.052632e+08,0.000000e+00,105263158,1.052632e+08,1.052632e+08,1.052632e+08,1.052632e+08,0.000000e+00,0.000000e+00
0030a2dd87ad2733e0873062e4f83954mkj86,1,1.000000,1.000000,0.000000e+00,1,1.000000,1.000000,0.000000e+00,1,1,...,0.000000,0.000000e+00,0.000000e+00,0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
003180b29c6a5f8f1d84a6b7b6f7be57tjj1o,3,0.333333,0.333333,0.000000e+00,3,0.333333,0.333333,0.000000e+00,1,1,...,0.000000,0.000000e+00,0.000000e+00,0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
00486a11dff552c4bd7696265724ff81yeo9v,10,0.050000,0.300000,7.071068e-02,8,0.050000,0.350000,1.060660e-01,1,1,...,0.634324,1.518721e+12,2.301461e+12,0,1.090126e+12,5.571737e+12,1.401996e+12,1.246061e+12,1.775155e+12,1.559352e+11
0051aef3fdeacdadba664b9b3b07e04e4coc6,10,0.014706,0.352941,1.196159e-01,6,0.014706,0.573529,2.122045e-01,1,1,...,15.806328,3.658421e+12,3.867043e+12,0,2.938495e+12,1.070468e+13,3.235968e+12,2.957614e+12,4.165561e+12,5.661764e+10
0053b78cde37c4384a20d2da9aa4272aym4pb,1951,0.000091,0.203675,5.402422e-03,518,0.000091,0.074778,5.825471e-03,1,1,...,69.590688,1.783254e+13,2.853382e+13,0,1.068965e+12,7.041958e+13,1.387711e+13,2.712753e+12,8.310077e+12,2.983981e+12
0061edfc5b07ff3d70d693883a38d370oy4fs,53,0.007463,0.082090,1.731162e-02,45,0.007463,0.089552,1.929995e-02,1,1,...,3.234167,4.446215e+12,4.464034e+12,0,1.230456e+12,1.125195e+13,2.864519e+12,1.846340e+12,2.574046e+12,6.243586e+11
00862324eb508ca5202b6d4e5f1a80fc3t3lp,1,1.000000,1.000000,0.000000e+00,1,1.000000,1.000000,0.000000e+00,1,1,...,0.000000,3.052632e+09,0.000000e+00,526315790,5.263158e+08,1.210526e+09,1.210526e+09,7.631579e+08,0.000000e+00,2.696566e+08
009479273c288b1dd096dc3087653499lrx3c,1,1.000000,1.000000,0.000000e+00,1,1.000000,1.000000,0.000000e+00,1,1,...,0.000000,0.000000e+00,0.000000e+00,0,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00


## Saving the final dataset

Now that we have a nice dataset, we need to make it a Machine Learnable one. Because we still have categories, we have variables that have a lot of different spans, etc. To do this, I used the `DataFrameMapper` class of [sklearn-pandas package](https://github.com/paulgb/sklearn-pandas) that allows you to easily transform a DataFrame into a numpy matrix of numbers.

In [7]:
# First lets join the dataset with outcome because it might be a useful info in the future ;)
datasetFull = dataset.join(bidders[['outcome']])
types = datasetFull.dtypes

# Create a mapper that "standard scale" numbers and binarize categories
mapperArg = []
for col, colType in types.iteritems():
    if col == 'outcome':
        continue
    if colType.name == 'float64' or colType.name =='int64':
        mapperArg.append((col, sklearn.preprocessing.StandardScaler()))
    else:
        mapperArg.append((col, sklearn.preprocessing.LabelBinarizer()))
mapper = DataFrameMapper(mapperArg)

# Apply the mapper to create the cdataset
Xids = datasetFull.index.tolist()
X = mapper.fit_transform(datasetFull)
y = datasetFull[['outcome']].as_matrix()

# Last check!
print bidders['outcome'].value_counts()

# Save in pickle file
pickle.dump([Xids, X, y], open('Xy.pkl', 'wb'))

-1    4700
 0    1910
 1     103
dtype: int64


  "got %s" % (estimator, X.dtype))
