## Partition Ad Click Data by Day

The goal of this EMR notebook is to sample click data by day, so we can illustrate the importance of feedback loop in machine learning life cycle

In [1]:
%%info

### Define variables and import modules

In [1]:
s3_bucket = 'ai-in-aws/Click-Fraud'
train_prefix = 'train'
fileName = 'train.csv'

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1555722717507_0115,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


In [23]:
#Import modules

import os
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from pyspark.sql.window import Window 
from pyspark.ml.feature import * 
#import matplotlib.pyplot as plt

VBox()

### Read Ad Click Data

In [9]:
#Read the compressed training dataset containing details on every click 
ad_track_df = spark.read.option("header","true").option("separator", ",").csv(('s3://' + os.path.join(s3_bucket, fileName)))

VBox()

In [4]:
ad_track_df.show()

VBox()

+------+---+------+---+-------+-------------------+---------------+-------------+
|    ip|app|device| os|channel|         click_time|attributed_time|is_attributed|
+------+---+------+---+-------+-------------------+---------------+-------------+
| 83230|  3|     1| 13|    379|2017-11-06 14:32:21|           null|            0|
| 17357|  3|     1| 19|    379|2017-11-06 14:33:34|           null|            0|
| 35810|  3|     1| 13|    379|2017-11-06 14:34:12|           null|            0|
| 45745| 14|     1| 13|    478|2017-11-06 14:34:52|           null|            0|
|161007|  3|     1| 13|    379|2017-11-06 14:35:08|           null|            0|
| 18787|  3|     1| 16|    379|2017-11-06 14:36:26|           null|            0|
|103022|  3|     1| 23|    379|2017-11-06 14:37:44|           null|            0|
|114221|  3|     1| 19|    379|2017-11-06 14:37:59|           null|            0|
|165970|  3|     1| 13|    379|2017-11-06 14:38:10|           null|            0|
| 74544| 64|    

In [24]:
type(ad_track_df) # sql dataframe

VBox()

<class 'pyspark.sql.dataframe.DataFrame'>

### Partition Ad Click dataset by day

In [10]:
# Filter sql dataframe by date - day 1, 2, 3, 4, and 5
ad_track_df = ad_track_df.select("*", F.col("click_time").cast("date").alias("click_date"))

days = ['2017-11-06', '2017-11-07', '2017-11-08', '2017-11-09']
# Mon, # Tues,  # Wed, # Thurs
dict_of_df = {}

for idx, dy in enumerate(days):
    key_name = 'ad_track_day' + str(idx+1)
    # Should you do deepcopy since this is a pointer?
    dict_of_df[key_name] = ad_track_df.filter(ad_track_df.click_date == dy)

VBox()

In [11]:
dict_of_df.keys()

VBox()

['ad_track_day1', 'ad_track_day3', 'ad_track_day2', 'ad_track_day4']

In [19]:
dict_of_df['ad_track_day1'].select([F.min('is_attributed'), F.max('is_attributed')]).show()

VBox()

+------------------+------------------+
|min(is_attributed)|max(is_attributed)|
+------------------+------------------+
|                 0|                 1|
+------------------+------------------+

In [7]:
# Get unique value of click_date column
dict_of_df['ad_track_day1'].select('click_date').distinct().show()

VBox()

+----------+
|click_date|
+----------+
|2017-11-06|
+----------+

In [20]:
#select subset of rows
num_rows = 600000

for k in dict_of_df.keys():
    dict_of_df[k] =  dict_of_df[k].limit(num_rows) 

VBox()

In [21]:
#view the distribution of is_attributed

for k in dict_of_df.keys():
    dict_of_df[k].groupBy(F.col('is_attributed')).count().show()

VBox()

+-------------+------+
|is_attributed| count|
+-------------+------+
|            0|598948|
|            1|  1052|
+-------------+------+

+-------------+------+
|is_attributed| count|
+-------------+------+
|            0|598934|
|            1|  1066|
+-------------+------+

+-------------+------+
|is_attributed| count|
+-------------+------+
|            0|598885|
|            1|  1115|
+-------------+------+

+-------------+------+
|is_attributed| count|
+-------------+------+
|            0|598909|
|            1|  1091|
+-------------+------+

Data is heavily imbalanced - 0.1% of clicks result in app downloads

In [22]:
# Write the data in parquet format and save to S3 bucket
for k in dict_of_df.keys():
    dict_of_df[k].write.parquet('s3://' + s3_bucket + '/' + train_prefix + "/" + k + ".parquet", mode='overwrite')


VBox()