# Fraudulent click detector - Report

In this notebook, we are going to analyse the file produced by our detectors.

## Imported libraries

In [9]:
import json
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

## Utils Functions

In [133]:
class Event():
    topic = ""
    uid = ""
    ip = ""
    timestamp = 0
    impressionId = ""
    
    def __init__(self, jsline):
        obj = json.loads(jsline)
        self.topic = obj['eventType']
        self.uid = obj['uid']
        self.ip = obj['ip']
        self.timestamp = obj['timestamp']
        self.impressionId = obj['impressionId']

def create_df_from_file(filename):
    dico = []
    with open(filename) as f:
        for line in f :
            line = line[6:-2].replace(' ', '').replace("'", '"').replace('=', ':')
            elements = []
            for e in line.split(',') :
                idx = e.index(':')
                result = '"' + e[:idx] + '"' + e[idx:] 
                elements.append(result)
            line = '{' + ','.join(elements) + '}'
            data = json.loads(line)
            dico.append(data)
    return pd.DataFrame(dico)

def length_of_df(df):
    return len(df.index)

def get_ctr(df):
    n_displays = len(df[df.eventType=='display'].index)
    n_clicks = len(df[df.eventType=='click'].index)
    ctr = n_clicks / float(n_displays)
    print("The click-through-rate equals {:.1f}%.".format(100*ctr))
    return ctr

def get_ctr_after_fraud(df, invalid_df):
    n_displays = len(df[df.eventType=='display'].index)
    n_clicks = len(df[df.eventType=='click'].index)
    n_invalid = len(invalid_df.index)
    print("{:d} clicks have been detected as fraudulent out of {:d} clicks.".format(n_invalid, n_clicks))
    ctr = (n_clicks - n_invalid) / float(n_displays)
    print("The click-through-rate equals {:.1f}%.".format(100*ctr))
    return ctr

In [122]:
global_filename = "initial_stream.txt"
pattern1_filename = "pattern_1.txt"
pattern2_filename = "pattern_2.txt"
pattern3_filename = "pattern_3.txt"

initial_df = create_df_from_file(global_filename)
pattern1_df = create_df_from_file(pattern1_filename)
pattern2_df = create_df_from_file(pattern2_filename)
pattern3_df = create_df_from_file(pattern3_filename)

fraudulent_df = pattern1_df.append(pattern2_df).append(pattern3_df).drop_duplicates()

## Click-through-rate

We first compute the CTR of the initial stream, and get a rate of 34%.
then, we compute the CTR of the stream where all suspicious clicks are removed and get a rate of ...

In [131]:
get_ctr(initial_df)

The click-through-rate equals 34.3%.


0.34277238403452

In [132]:
get_ctr_after_fraud(initial_df, fraudulent_df)

1161 clicks have been detected as fraudulent out of 1271 clicks.
The click-through-rate equals 3.0%.


0.0296655879180151

## Patterns detector

In [134]:
print("The pattern that detects clicks without corresponding displays flagged {:d} clicks."
     .format(length_of_df(pattern1_df)))
print("The pattern that detects clicks that occured too fast flagged {:d} clicks."
     .format(length_of_df(pattern2_df)))
print("The pattern that detects clicks with hyperactive IP addresses {:d} clicks."
     .format(length_of_df(pattern3_df)))

The pattern that detects clicks without corresponding displays flagged 1110 clicks.
The pattern that detects clicks that occured too fast flagged 80 clicks.
The pattern that detects clicks with hyperactive IP addresses 521 clicks.
