# Python Project : Bot detection 
    
An auction website has a huge problems. A lot of accounts are actually bots that will do the betting and have a strong edge over the users. The website wants to find a way to detect the bots in order to kick them out of the system and then create a bettre experience for the customers.

To answer this problematic, we have a database of more than 7.5 Millions observations of betting, linked to the account with these differents observations : 
    
    bid_id - unique id for this bid
    bidder_id – Unique identifier of a bidder
    auction – Unique identifier of an auction
    merchandise –  The category of the auction site campaign, which means the bidder might come to this site by way of searching for "home goods" but ended up bidding for "sporting goods" - and that leads to this field being "home goods". This categorical field could be a search term, or online advertisement. 
    device – Phone model of a visitor
    time - Time that the bid is made.
    country - The country that the IP belongs to
    ip – IP address of a bidder.
    url - url where the bidder was referred from. 
    
We also have a train dataset with the bidders with : 

    bidder_id – Unique identifier of a bidder.
    payment_account – Payment account associated with a bidder. These are obfuscated to protect privacy. 
    address – Mailing address of a bidder. These are obfuscated to protect privacy. 
    outcome – Label of a bidder indicating whether or not it is a robot. Value 1.0 indicates a robot, where value 0.0 indicates human. 
    
Here is a summary of the framework : 
    - Quick overview of the data and creation of new variables based on how we understand the problem
    - EDA of the data and new variables given
    - Utilization of the EDA to create second order variables
    - Creation of the prediction model
    

### Libraries

In [1]:
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import math 
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import auc
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

# 1 - Data frame quick overview and variables creation

If it is unusual to create new variables before even doing an EDA, it will be here extremely useful. In fact, it is difficult to predict and analyze anything from the variables given. Being in a time-serie context and having to predict something that is directed towards the accounts and not the bids (which is the main informative dataframe) put us in a context where we have to modify the data before even going into an EDA
Here lets have an overview of what we have :

In [2]:
bids = pd.read_csv("Downloads/bids.csv",index_col = "bid_id")
train = pd.read_csv("Downloads/train.csv")

  mask |= (ar1 == a)


In [3]:
print(bids.shape)
bids.head(10)

(7656334, 8)


Unnamed: 0_level_0,bidder_id,auction,merchandise,device,time,country,ip,url
bid_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,8dac2b259fd1c6d1120e519fb1ac14fbqvax8,ewmzr,jewelry,phone0,9759243157894736,us,69.166.231.58,vasstdc27m7nks3
1,668d393e858e8126275433046bbd35c6tywop,aeqok,furniture,phone1,9759243157894736,in,50.201.125.84,jmqlhflrzwuay9c
2,aa5f360084278b35d746fa6af3a7a1a5ra3xe,wa00e,home goods,phone2,9759243157894736,py,112.54.208.157,vasstdc27m7nks3
3,3939ac3ef7d472a59a9c5f893dd3e39fh9ofi,jefix,jewelry,phone4,9759243157894736,in,18.99.175.133,vasstdc27m7nks3
4,8393c48eaf4b8fa96886edc7cf27b372dsibi,jefix,jewelry,phone5,9759243157894736,in,145.138.5.37,vasstdc27m7nks3
5,e8291466de91b0eb4e1515143c7f74dexy2yr,3vi4t,mobile,phone7,9759243157894736,ru,91.107.221.27,vasstdc27m7nks3
6,eef4c687daf977f64fc1d08675c44444raj3s,kjlzx,mobile,phone2,9759243210526315,th,152.235.155.159,j9nl1xmo6fqhcc6
7,ab056855c9ca9d36390feae1fa485883issyg,f5f6k,home goods,phone8,9759243210526315,id,3.210.112.183,hnt6hu93a3z1cpc
8,d600dc03b11e7d782e1e4dae091b084a1h5ch,h7jjx,home goods,phone9,9759243210526315,th,103.64.157.225,vasstdc27m7nks3
9,a58ace8b671a7531c88814bc86b2a34cf0crb,3zpkj,sporting goods,phone4,9759243210526315,za,123.28.123.226,vasstdc27m7nks3


In [4]:
print(train.shape)
train.head(10)

(2013, 4)


Unnamed: 0,bidder_id,payment_account,address,outcome
0,91a3c57b13234af24875c56fb7e2b2f4rb56a,a3d2de7675556553a5f08e4c88d2c228754av,a3d2de7675556553a5f08e4c88d2c228vt0u4,0.0
1,624f258b49e77713fc34034560f93fb3hu3jo,a3d2de7675556553a5f08e4c88d2c228v1sga,ae87054e5a97a8f840a3991d12611fdcrfbq3,0.0
2,1c5f4fc669099bfbfac515cd26997bd12ruaj,a3d2de7675556553a5f08e4c88d2c2280cybl,92520288b50f03907041887884ba49c0cl0pd,0.0
3,4bee9aba2abda51bf43d639013d6efe12iycd,51d80e233f7b6a7dfdee484a3c120f3b2ita8,4cb9717c8ad7e88a9a284989dd79b98dbevyi,0.0
4,4ab12bc61c82ddd9c2d65e60555808acqgos1,a3d2de7675556553a5f08e4c88d2c22857ddh,2a96c3ce94b3be921e0296097b88b56a7x1ji,0.0
5,7eaefc97fbf6af12e930528151f86eb91bafh,a3d2de7675556553a5f08e4c88d2c228yory1,5a1d8f28bc31aa6d72bef2d8fbf48b967hra3,0.0
6,25558d24bca82beef0f9db4ba1fe2045ynnvq,81580585d4dedd473da11aabf37fe9d4e2s2n,9a6d81115b9b653ba326eb510e9163b47drqj,0.0
7,88ae7a35e374a6fddd079ebb28c822eeohwse,a3d2de7675556553a5f08e4c88d2c2289zref,3a7e6a32b24aeab0688e91a41f3188e22iuec,0.0
8,57db69e32163f3e486dc6ef7d615aa12usje6,bf1c3151cc309308077ad0ccb99779ad12apw,31b95425d178b89fd7306762bb48bfb5n04sj,0.0
9,d1be739798ba0745a1fd72ac918a9f1929hei,f49162ea9903fc00e4721d2f7972df9d6az4s,5b1f6e97a1cc27cd7fa9a3fe17eccd2a6mpdv,0.0


In [5]:
print(train[train.outcome == 1].shape)
print(train[train.outcome == 0].shape)


(103, 4)
(1910, 4)


In [6]:
bids.shape

(7656334, 8)

Creating the first values that could be interesting based on our intuition:
    - deltatime_id: The time difference between two bets by the same user
            We assume that putting three bids in the same second could be suspicious
    - deltatime_auction: The time difference between a new bet and the last one put in the same auction
            Also, it looks suscpicious if an account is always able to put a new bet nearly instantly after someone betted
    - nbids: The number of bids made by the same account
            This will be an important factor. Having a high number of bids does not necessarly imply that it is a bot. Nonetheless we think this factor could help us determine some susipicious behaviours
    - bids_per_auction: The number of bids made by the same user per auction.
    - mean_deltatime_id: The average time length between 2 bids for the same user.
    - mean_deltatime_auction_id: The average time length between 2 bids for the same user for the same auction.
    - min_deltatime_per_auction_per_id: The minimum time difference between bids per user for each auction.
    - number_ip_per_id: The number of ips used by each user.
    - number_url_per_idL the number of urls used by each user.

In [7]:
bids.bidder_id.nunique()

6614

In [8]:
bids["deltatime_id"]      = bids.groupby("bidder_id")['time'].diff()
bids['nbids']             = bids.groupby('bidder_id')['bidder_id'].transform('count')
bids['mean_deltatime_id'] = bids['deltatime_id']/bids['nbids']
bids['number_ip_id']      = bids.groupby(['bidder_id'])["ip"].transform('nunique')
bids['number_url_id']     = bids.groupby(['bidder_id'])["url"].transform('nunique')
bids['time_range']        = bids.groupby(['bidder_id'])['time'].max() - bids.groupby(['bidder_id'])['time'].min()
bids.shape

(7656334, 14)

The bids_id dataframe contains aggregrate values which might be useful for the model.
    
    -nbids: represents the number of bids per bidder id
    -mean_deltatime_id: represents the average time difference between two bids per bidder id.
    -number_ip_id: The number of ip address used by a single bidder id
    -number_url_id: The number of urls used by a single bidder id
    -time_range: The time between the first and last bids per bidder id
    -mean_deltatime_auction: the average duration between bids per auction and bidder.
    -mean_deltatime_auction_id: The average of the average bid length per bidder id
    -mean_nbids_auction_id: mean number of bids per auction per biider id.
    -entropy_ip: the ip entropy for each bidder 
    -entropy_url: the url entropy for each bidder

In [15]:
nbids             = pd.DataFrame(bids.groupby('bidder_id')['bidder_id'].count())
mean_deltatime_id = pd.DataFrame(bids.groupby('bidder_id')['deltatime_id'].mean())
number_ip_id      = pd.DataFrame(bids.groupby(['bidder_id'])["ip"].nunique())
number_url_id     = pd.DataFrame(bids.groupby(['bidder_id'])["url"].nunique())
time_range        = pd.DataFrame(bids.groupby('bidder_id')['time'].max() - bids.groupby('bidder_id')['time'].min())
min_delta_time    = pd.DataFrame(bids.loc[:,['bidder_id','deltatime_id']].groupby('bidder_id')['deltatime_id'].min()).rename(columns={"deltatime_id":"min_delta_time"})

bids_id           = nbids.join(mean_deltatime_id).join(number_ip_id).join(number_url_id).join(time_range).join(min_delta_time)
bids_id           = bids_id.rename(columns={"bidder_id": "nbids", 'deltatime_id' : "mean_deltatime_id", "ip" : "number_ip_id", "url":"number_url_id", "time": "time_range" }).reset_index(drop=False)

mean_deltatime_auction    = bids.groupby(["bidder_id",'auction'])["deltatime_id"].mean().reset_index(drop=False)
mean_deltatime_auction_id = mean_deltatime_auction.groupby(["bidder_id"]).mean().reset_index(drop=False)
nbids_auction             = pd.DataFrame(bids.groupby(["bidder_id",'auction'])["auction"].count())
mean_nbids_auction_id     = nbids_auction.groupby(["bidder_id"]).mean().rename(columns={"auction":"mean_nbids_auction_id"})

bids_id                   = bids_id.set_index('bidder_id').join(mean_deltatime_auction_id.set_index('bidder_id')).join(mean_nbids_auction_id)
bids_id                   = bids_id.rename(columns={"deltatime_id": "mean_deltatime_auction_id"})

This section is used to compute the ip entropy per user. The ip entropy is a measure used to evaluate the randomness of a variable, as well as the number of bids and the number of bids per ip. It has the following equation:
$$ IP, Entropy = \frac{N_{bids}!}{N_{ip,1}!N_{ip,2}!...N_{ip,n}!}$$
Where N represents the total number of bids done by the bidder, N_{ip,1} represents the number of bids done by the bidder on one ip address and so on.

The process of getting the ip entropy is the following:
    1. Compute the factorial of total number of bids per bidder id.
    2. Count the number of time the same ip was used by the user.
    3. Find the factorial of each of those numbers.
    4. Multiple those factorials for each bidder id.
    5. Join with the bid_id DataFrame.
    6. Finally divide step 1 by step 4 and take the log10 of that result.
    

THIS STEP IS TAKING SOME TIME TO RUN

In [None]:
bids_id['entropy_ip'] = bids_id['nbids'].apply(math.factorial)
temp                  = pd.DataFrame(bids.groupby(['bidder_id','ip'])['ip'].count()).rename(columns ={'ip': 'nb_per_ip'}).reset_index(drop=False)
temp                  = temp.groupby(['bidder_id','ip'])['nb_per_ip'].apply(math.factorial).reset_index(drop=False)
temp                  = temp.groupby('bidder_id')['nb_per_ip'].prod().reset_index(drop=False)
bids_id               = bids_id.join(temp.set_index('bidder_id'))
bids_id['entropy_ip'] = bids_id['entropy_ip']//bids_id['nb_per_ip']
bids_id.head()


The same was done to obtain the URL entropy. The same steps as those to obtain the ip entropy were used. We have therefore the following formula:
$$ URL, Entropy = \frac{N_{bids}!}{N_{url,1}!N_{url,2}!...N_{url,n}!}$$ 

In [None]:
bids_id['entropy_url'] = bids_id['nbids'].apply(math.factorial)
temp                   = pd.DataFrame(bids.groupby(['bidder_id','url'])['url'].count()).rename(columns ={'url': 'nb_per_url'}).reset_index(drop=False)
temp                   = temp.groupby(['bidder_id','url'])['nb_per_url'].apply(math.factorial).reset_index(drop=False)
temp                   = temp.groupby('bidder_id')['nb_per_url'].prod().reset_index(drop=False)
bids_id                = bids_id.join(temp.set_index('bidder_id'))
bids_id['entropy_url'] = bids_id['entropy_url']//bids_id['nb_per_url']
bids_id.head()

In [None]:
#creation of our ratio, lets assume that the time is in miliseconds, a per day ratio seems understandable
bids['nb_bets_per_time_range'] = bids.nbids/bids.time_range*86400000
bids.head(10)

We have one huge data set where we will have to understand and use informations (bids) and another one where we have to predict based on the user key. 

Lets first add the outcome in the bids dataframe in order to be able to recognize which account are corrupted just by looking at the bids table. 
We are expecting a lot of NaN since all accounts in "bids" are not all idifiented as clear or bots

In [None]:
#seems optional and dispensable at this point
train_prejoin = pd.DataFrame(train.loc[:,['bidder_id','outcome']])
bids1 = bids.join(train_prejoin.set_index('bidder_id'),on='bidder_id')
print(bids1.shape)
bids1.head(10)





Getting the data created thanks to the bids dataframe joined in the accounts data frame. This is where we are going to apply our algorithm



In [None]:
bids_prejoin = bids_id.loc[:,["nbids", "mean_deltatime_id", "time_range", "deltatime_id","entropy_ip","entropy_url","min_delta_time","mean_deltatime_auction_id","mean_nbids_auction_id"]]
trainjoin = train.set_index('bidder_id').join(bids_prejoin)
trainjoin.head(10)


In [None]:
trainjoin = trainjoin.replace([np.inf, -np.inf], np.nan)
trainjoin = trainjoin.dropna(subset=['nbids','min_delta_time','mean_deltatime_auction_id'])
trainjoin.head(10)

#  2- Graphic representation of the Results:

In [None]:
plt.figure()
plt.subplot(1,2,1)
plt.hist([trainjoin.nbids[trainjoin["outcome"]==1].apply(lambda x: math.log10(x))],bins =30, color = ['red'])
plt.xlabel('logarithm of number of bids')
plt.ylabel('number of users')
plt.title('Accounts identified as Bots')
plt.subplot(1,2,2)
plt.hist([trainjoin.nbids[trainjoin["outcome"]==0].apply(lambda x: math.log10(x))],bins=30, color = ['blue'])
plt.xlabel('logarithm of number of bids')
plt.ylabel('number of users')
plt.title('Accounts not identified as Bots')
plt.show()

In [None]:
plt.figure()
plt.subplot(1,2,1)
plt.hist([trainjoin.mean_nbids_auction_id[trainjoin["outcome"]==1].apply(lambda x: math.log10(x))],bins =30, color = ['red'])
plt.xlabel('logarithm of number of bids')
plt.ylabel('number of users')
plt.title('Accounts identified as Bots')
plt.subplot(1,2,2)
plt.hist([trainjoin.mean_nbids_auction_id[trainjoin["outcome"]==0].apply(lambda x: math.log10(x))],bins=30, color = ['blue'])
plt.xlabel('logarithm of number of bids')
plt.ylabel('number of users')
plt.title('Accounts not identified as Bots')
plt.show()

In [None]:
plt.figure()
plt.subplot(1,2,1)
plt.hist([trainjoin.entropy_ip[trainjoin["outcome"]==1].apply(lambda x: math.log10(x))],bins =30, color = ['red'])
plt.xlabel('entropy of ip, log based')
plt.ylabel('frequency')
plt.title('entropy of ip, log based')
plt.subplot(1,2,2)
plt.hist([trainjoin.entropy_ip[trainjoin["outcome"]==0].apply(lambda x: math.log10(x))],bins=30, color = ['blue'])
plt.xlabel('entropy of ip, log based')
plt.ylabel('frequency')
plt.title('Accounts not identified as Bots')
plt.show()

In [None]:
plt.figure()
plt.subplot(1,2,1)
plt.hist([trainjoin.entropy_url[trainjoin["outcome"]==1].apply(lambda x: math.log10(x))],bins =30, color = ['red'],label = ['bot'])
plt.xlabel('entropy of url, log based')
plt.ylabel('frequency')
plt.title('entropy of url, log based')
plt.subplot(1,2,2)
plt.hist([trainjoin.entropy_url[trainjoin["outcome"]==0].apply(lambda x: math.log10(x))],bins=30, color = ['blue'],label = ['not a bot'])
plt.xlabel('entropy of url, log based')
plt.ylabel('frequency')
plt.title('Accounts not identified as Bots')
plt.show()

In [None]:
filter_zero = trainjoin[trainjoin["min_delta_time"]!=0]

plt.figure()
plt.subplot(1,2,1)
plt.hist([filter_zero.min_delta_time[filter_zero["outcome"]==1].apply(lambda x: math.log10(x))],bins =30, color = ['red'],label = ['bot'])
plt.xlabel('minimimun time between bids, log based')
plt.ylabel('number of users')
plt.title('Accounts identified as Bots')
plt.subplot(1,2,2)
plt.hist([filter_zero.min_delta_time[filter_zero["outcome"]==0].apply(lambda x: math.log10(x))],bins=30, color = ['blue'],label = ['not a bot'])
plt.xlabel('minimimun time between bids, log based')
plt.ylabel('number of users')
plt.title('Accounts not identified as Bots')
plt.show()

In [None]:
plt.figure()
plt.subplot(1,2,1)
plt.hist([trainjoin.mean_deltatime_auction_id[trainjoin["outcome"]==1].apply(lambda x: math.log10(x))],bins =30, color = ['red'],label = ['bot'])
plt.xlabel('logarithm of number of bids')
plt.ylabel('number of users')
plt.title('Accounts identified as Bots')
plt.subplot(1,2,2)
plt.hist([trainjoin.mean_deltatime_auction_id[trainjoin["outcome"]==0].apply(lambda x: math.log10(x))],bins=30, color = ['blue'],label = ['not a bot'])
plt.xlabel('logarithm of number of bids')
plt.ylabel('number of users')
plt.title('Accounts not identified as Bots')
plt.show()

# 3- Model Selection:

This section is focused on selecting the correct model in order to predict whether or not a bidder_id is a bot. We focused on choosing an XGboost program using sklearn. 

In [None]:

trainjoin.fillna(-1,inplace = True)
print(trainjoin.isna().sum())
trainjoin.head(10)

In [None]:
X = trainjoin.iloc[:,4:12]
Y = trainjoin.iloc[:,3]

seed = 77
test_size = 0.2
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

Defining early stopping value

In [32]:
clf = XGBClassifier()
eval_set  = [(X_train,y_train), (X_test,y_test)]
clf.fit(X_train, y_train, eval_set=eval_set,
        eval_metric="auc", early_stopping_rounds=30)

ValueError: DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields entropy_ip, entropy_url

In [None]:
model = XGBClassifier(early_stopping = 2)
model.fit(X_train, y_train)
print(model)

In [None]:
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]


Conf_matrix= confusion_matrix(y_test,y_pred)
TN, FP, FN, TP = Conf_matrix[0,0],Conf_matrix[0,1],Conf_matrix[1,0],Conf_matrix[1,1]
TNR, TPR, PPV, NPV = TN/(TN+FP), TP/(FN + TP), TP/(FP+TP),TN/(TN+FN)
print(Conf_matrix)
print('True Negative Rate :', TNR, '\n'
      'True Positive Rate :', TPR, '\n'
      'Positive Predictive Value :', PPV, '\n' 
      'Negative Predictive Value :' ,NPV )


accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Cross_Validation

In [None]:

kfold = KFold(n_splits=5, random_state=77)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Preparation of the Data for Kaggle

In [None]:
test = pd.read_csv("Downloads/test.csv")
print(test.shape)
test.head(10)



In [None]:
bids_prejoin = bids.loc[:,['bidder_id','nbids','time_range','nb_bets/time_range']]

testjoin = pd.merge(test,bids_prejoin,
                 on='bidder_id',
                   how = 'left')

testjoin.drop_duplicates(inplace=True)

print(testjoin.shape)
testjoin.head(10)

In [None]:
testjoin.dropna(subset = ['bidder_id'], inplace = True)

testjoin.head(10)
print(testjoin.shape)

In [None]:
testjoin.fillna(-1,inplace = True)
print(testjoin.isna().sum())
testjoin.head(10)

In [None]:
X_final = testjoin.iloc[:,3:6]

y_final = model.predict(X_final)

print(len(y_final))
print(y_final)
predictions = [round(value) for value in y_pred]
print(len(predictions))

test['outcome'] = y_final

In [None]:
test.head(10)

In [None]:
test2 = test
test2.to_csv('Downloads/test2.csv',sep = ',')