### Feature Extraction

The feature set we used in this work consists of 13 categoric, 17 numeric at total 30 different variables. Categoric ones are related to gender of favorite brand/category/product, frequently interacted daytime and weekday of each customers. Numeric ones are the basic action counts, action counts on spesific category, brand or business unit, price related features and session time. We paid attention to naming the colums so that they are self-explanatory which is as follows:

- brand_gender 
- cat_gender 
- fav_product_gender 
- wDay_basket 
- wDay_favorite 
- wDay_order 
- wDay_search 
- wDay_visit 
- daytime_basket 
- daytime_favorite 
- daytime_order 
- daytime_search 
- daytime_visit 
- fav_avg_price 
- order_avg_price 
- basket_avg_price 
- count_basket 
- count_favorite 
- count_order 
- count_search 
- count_visit 
- female_category_action_count 
- female_brand_action_count 
- male_brand_action_count 
- avg_price 
- female_businessunit_action_count 
- male_businessunit_action_count 
- female_content_action_count 
- male_content_action_count 
- avg_session_time

In [1]:
require(data.table)
require(caret)
require(lubridate)

Loading required package: data.table

"package 'data.table' was built under R version 3.6.3"
Loading required package: caret

"package 'caret' was built under R version 3.6.3"
Loading required package: lattice

"package 'lattice' was built under R version 3.6.3"
Loading required package: ggplot2

Registered S3 methods overwritten by 'tibble':
  method     from  
  format.tbl pillar
  print.tbl  pillar

Loading required package: lubridate

"package 'lubridate' was built under R version 3.6.3"

Attaching package: 'lubridate'


The following objects are masked from 'package:data.table':

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year


The following objects are masked from 'package:base':

    date, intersect, setdiff, union




In [5]:
current_folder = getwd()

train = fread('model-data/raw-data/train.csv', encoding = "UTF-8")
test = fread('model-data/raw-data/test.csv', encoding = "UTF-8")
test_id = fread('test_ids_in_prediction.csv', encoding = "UTF-8")
response = unique(train[,c("unique_id","gender")])

# we have identified duplicate records both in train and test sets so it's needed to take unique ones
train = unique(train)
test = unique(test)

In [3]:
head(train)

time_stamp,contentid,user_action,sellingprice,product_name,brand_id,brand_name,businessunit,product_gender,category_id,Level1_Category_Id,Level1_Category_Name,Level2_Category_Id,Level2_Category_Name,Level3_Category_Id,Level3_Category_Name,gender,unique_id,type
2020-12-02T22:26:14.023Z,39918893,favorite,3099.0,PerfectCare 600 EW6F449ST A+++ 9 KG 1400 Devir Çamasir Makinesi,8511,Electrolux,Beyaz Esya,Unisex,1272,1071,Elektronik,1212,Beyaz Esya,1272,Çamasir Makinesi,F,425,train
2020-12-08T23:15:04.603Z,3558544,favorite,3079.0,WW90J5475FW A+++ 1400 Devir 9 kg Çamasir Makinesi,3228,Samsung,Beyaz Esya,,1272,1071,Elektronik,1212,Beyaz Esya,1272,Çamasir Makinesi,F,425,train
2020-12-05T16:19:01.157Z,31292729,favorite,3999.0,KM 9711 A++ 9 kg Çamasir Kurutma Makinesi,10989,Vestel,Beyaz Esya,Unisex,1276,1071,Elektronik,1212,Beyaz Esya,1276,Kurutma Makinesi,F,425,train
2020-12-05T16:28:00Z,6363103,visit,2544.0,CMI 9710 A+++ 1000 Devir 9 kg Çamasir Makinesi,10989,Vestel,Beyaz Esya,,1272,1071,Elektronik,1212,Beyaz Esya,1272,Çamasir Makinesi,F,425,train
2020-12-02T22:26:59Z,39918893,visit,3099.0,PerfectCare 600 EW6F449ST A+++ 9 KG 1400 Devir Çamasir Makinesi,8511,Electrolux,Beyaz Esya,Unisex,1272,1071,Elektronik,1212,Beyaz Esya,1272,Çamasir Makinesi,F,425,train
2020-11-03T21:04:11Z,32593071,visit,266.65,Siyah Kadin Abiye Ayakkabi 01AYH158420A100,59,Hotiç,Branded Shoes A,Kadin,431,403,Ayakkabi,430,Topuklu Ayakkabi,431,Abiye Ayakkabi,F,425,train


#### 1. Favorite Brand Gender based on Actions

In [7]:
# count different action types on brands for each customer
# sort them in decreasing order 
# select the brands which have most frequent action counts for each action type 
fav_brand = train[,.N,c("unique_id","user_action","brand_id")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [8]:
# an example of one customer
# 21663 is the most frequent brand for 4 action types 
# which means customer-425 visited brand-21663 22 times and it is the highest count among the all visits of customer-425
fav_brand[unique_id==425]

unique_id,user_action,brand_id,N
425,basket,21663,2
425,favorite,21663,11
425,order,1423,1
425,search,21663,12
425,visit,21663,22


In [9]:
# count occurance of favorite brand for each action type
# sort them in decreasing order 
# select the brand which is mostly occurred as favorite 
fav_brand_by_action = fav_brand[,.N,c("unique_id","brand_id")][order(-N)][, head(.SD, 1), by = c("unique_id")]

In [10]:
# since 21663 occurred 4 times as mostly actioned brand among the action types, it is selected as favorite brand of customer 424
# this algorithm is used because we didn't want to prioritize the action types
# we cannot say that the favorite brand of 425 is 1423 because he/she ordered just for 1 time
fav_brand_by_action[unique_id==425]

unique_id,brand_id,N
425,21663,4


In [15]:
fav_brand_by_action = fav_brand_by_action[,1:2]
setnames(fav_brand_by_action, "brand_id", "fav_brand")

In [6]:
fav_brand_by_action[unique_id==425]

unique_id,fav_brand
425,21663


In [11]:
# count product genders under each brand
# sort them in decreasing order 
# select the gender which is most frequent
brand_gender = train[,.N,by=c("brand_id", "brand_name", "product_gender")][order(brand_id, -N)][, head(.SD, 1), by = c("brand_id")]

In [8]:
# an example of one brand
brand_gender[brand_name=="Hotiç"]

brand_id,brand_name,product_gender,N
59,Hotiç,Kadin,3106


In [12]:
brand_gender = brand_gender[,c("brand_id", "product_gender")]
setnames(brand_gender, "product_gender", "brand_gender")

In [13]:
brand_gender[brand_id=="59"]

brand_id,brand_gender
59,Kadin


In [16]:
fav_brand_by_action = merge(fav_brand_by_action, brand_gender, by.x="fav_brand", by.y="brand_id", all.x=T)
fav_brand_by_action[,brand_gender:=ifelse(brand_gender==''|is.na(fav_brand_by_action$brand_gender), "Unisex", brand_gender)]

In [17]:
fav_brand_by_action[fav_brand==59]

fav_brand,unique_id,brand_gender
59,3674,Kadin
59,5867,Kadin
59,5478,Kadin
59,4713,Kadin
59,6577,Kadin
59,2714,Kadin
59,2576,Kadin
59,2416,Kadin
59,6995,Kadin


#### 2. Favorite Category Gender based on Actions

In [18]:
# first we concat the Level1_Category_Id and Level2_Category_Id in order to use the hierachy information
train$concat_Category_Id = paste0(train$Level1_Category_Id, train$Level2_Category_Id)

In [19]:
# count different action types on categories for each customer
# sort them in decreasing order 
# select the category which have most frequent action counts for each action type 
fav_cat = train[,.N,c("unique_id","user_action","concat_Category_Id")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [21]:
# an example of one customer
# 5222871 is the most frequent category for all 5 action types 
# which means customer-1511 visited category-5222871 1433 times and it is the highest count among the all visits of customer-1511
fav_cat[unique_id==1511]

unique_id,user_action,concat_Category_Id,N
1511,basket,5222871,197
1511,favorite,5222871,241
1511,order,5222871,11
1511,search,5222871,332
1511,visit,5222871,1433


In [35]:
# count occurance of favorite category for each action type
# sort them in decreasing order 
# select the brand which is mostly occurred as favorite 
fav_cat_by_action = fav_cat[,.N,c("unique_id","concat_Category_Id")][order(-N)][, head(.SD, 1), by = c("unique_id")]

In [25]:
fav_cat_by_action[unique_id==1511]

unique_id,concat_Category_Id,N
1511,5222871,5


In [36]:
fav_cat_by_action = fav_cat_by_action[,1:2]
setnames(fav_cat_by_action, "concat_Category_Id", "fav_category")

In [32]:
# count product genders under each category
# sort them in decreasing order 
# select the gender which is most frequent
cat_gender = train[,.N,by=c("concat_Category_Id", "product_gender")][order(concat_Category_Id, -N)][, head(.SD, 1), by = c("concat_Category_Id")]

In [33]:
# an example of one category
cat_gender[concat_Category_Id=="5222871"]

concat_Category_Id,product_gender,N
5222871,Kadin,277218


In [34]:
cat_gender = cat_gender[,c("concat_Category_Id", "product_gender")]
setnames(cat_gender, "product_gender", "cat_gender")

In [29]:
cat_gender[concat_Category_Id=="5222871"]

concat_Category_Id,cat_gender
5222871,Kadin


In [37]:
fav_cat_by_action = merge(fav_cat_by_action, cat_gender, by.x="fav_category", by.y="concat_Category_Id", all.x=T)
fav_cat_by_action[,cat_gender:=ifelse(cat_gender==''|is.na(fav_cat_by_action$cat_gender), "Unisex", cat_gender)]

In [38]:
fav_cat_by_action[fav_category=="5222871"&unique_id==1511]

fav_category,unique_id,cat_gender
5222871,1511,Kadin


#### 3. Favorite Product Gender based on Actions

In [39]:
# count different action types on product genders for each customer
# sort them in decreasing order 
# select the product gender which have most frequent action counts for each action type 
fav_product_gender = train[,.N,c("unique_id","user_action","product_gender")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [40]:
# an example of one customer
# Products with labeled as "Kadin" is the most frequent category for all 5 action types 
# which means customer-319 visited "Kadın" products 2968 times and it is the highest count among the all visits of customer-319
fav_product_gender[unique_id==319]

unique_id,user_action,product_gender,N
319,basket,Kadin,701
319,favorite,Kadin,649
319,order,Kadin,15
319,search,Kadin,2192
319,visit,Kadin,2968


In [41]:
# count occurance of favorite product gender for each action type
# sort them in decreasing order 
# select the product gender which is mostly occurred as favorite 
fav_product_gender_by_action = fav_product_gender[,.N,c("unique_id","product_gender")][order(-N)][, head(.SD, 1), by = c("unique_id")]

In [42]:
fav_product_gender_by_action = fav_product_gender_by_action[,1:2]
setnames(fav_product_gender_by_action, "product_gender", "fav_product_gender")

In [43]:
fav_product_gender_by_action[unique_id==319]

unique_id,fav_product_gender
319,Kadin


In [44]:
fav_product_gender_by_action[,fav_product_gender:=ifelse(fav_product_gender==''|is.na(fav_product_gender_by_action$fav_product_gender), "Unisex", fav_product_gender)]

In [45]:
fav_product_gender_by_action[unique_id==319]

unique_id,fav_product_gender
319,Kadin


#### 4. Occurance Days of Actions

In [33]:
# create date column from time_stamp in date type
train$date = as.Date(train$time_stamp)
weekdays1 = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
# create a column wDay which calculates whether a date is in weekday or weekend
train$wDay = c('weekend', 'weekday')[(weekdays(train$date) %in% weekdays1)+1L]

In [34]:
# count different action types for each customer based on the days they happened
# sort them in decreasing order 
# select the wDay which have most frequent action counts for each action type 
actions_by_wDay = train[,.N, c("unique_id","user_action","wDay")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [39]:
# an example of one customer
# customer-425 does the following actions mostly at the weekend compared to weekdays
actions_by_wDay[unique_id==425]

unique_id,user_action,wDay,N
425,basket,weekend,10
425,favorite,weekend,44
425,order,weekend,4
425,search,weekend,55
425,visit,weekend,128


In [35]:
# change it to long format
actions_by_wDay_long = dcast(actions_by_wDay, unique_id~user_action, value.var='wDay')

In [36]:
# rename columns
colnames(actions_by_wDay_long)[2:ncol(actions_by_wDay_long)] = paste("wDay", colnames(actions_by_wDay_long)[2:ncol(actions_by_wDay_long)], sep = "_")

In [70]:
actions_by_wDay_long[unique_id==425]

unique_id,wDay_basket,wDay_favorite,wDay_order,wDay_search,wDay_visit
425,weekend,weekend,weekend,weekend,weekend


In [37]:
actions_by_wDay_long[, 2:6][is.na(actions_by_wDay_long[, 2:6])] = "Unknown"

#### 5. Occurance Times of Actions

In [38]:
# returns time interval of given timestamp 

nightday <- function(datetime) {
  paste(
    c("night", "morning", "afternoon", "evening", "night")[
      cut(as.numeric(format(datetime, "%H%M")), c(-0.01, 530, 1100, 1700 ,2000, 2359))
      ]
  )
}

In [39]:
train$timestamp = as.POSIXct(train$time_stamp, format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC")

In [40]:
train$daytime = nightday(train$timestamp)

In [41]:
# count different action types based on the time interval they happened
# sort them in decreasing order 
# select the time interval which have most frequent action counts for each action type 
actions_by_daytime = train[,.N, c("unique_id","user_action","daytime")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [8]:
# an example of one customer
# customer-19 does the following actions mostly at nights compared to other time intervals
actions_by_daytime[unique_id==19]

unique_id,user_action,daytime,N
19,basket,night,113
19,favorite,night,844
19,order,night,13
19,search,night,1918
19,visit,night,4567


In [42]:
actions_by_daytime_long = dcast(actions_by_daytime, unique_id~user_action, value.var='daytime')

In [43]:
colnames(actions_by_daytime_long)[2:ncol(actions_by_daytime_long)] = paste("daytime", colnames(actions_by_daytime_long)[2:ncol(actions_by_daytime_long)], sep = "_")

In [49]:
actions_by_daytime_long[unique_id==19]

unique_id,daytime_basket,daytime_favorite,daytime_order,daytime_search,daytime_visit
19,night,night,night,night,night


In [44]:
actions_by_daytime_long[, 2:6][is.na(actions_by_daytime_long[, 2:6])] = "Unknown"

#### 6. Monetary Value

In [129]:
# calculate average price of products favorited for each customer
avg_price_by_fav = train[user_action=="favorite", list(fav_avg_price=mean(sellingprice)) ,"unique_id"]

In [133]:
# calculate average price of products ordered for each customer
avg_price_by_order = train[user_action=="order", list(order_avg_price=mean(sellingprice)) ,"unique_id"]

In [135]:
# calculate average price of products put in basket for each customer
avg_price_by_basket = train[user_action=="basket", list(basket_avg_price=mean(sellingprice)),"unique_id"]

In [42]:
# calculate the average price of all contents that each customers showed interest
avg_price_by_unique_id = train[,list(avg_price=mean(sellingprice, na.rm=T)), "unique_id"]

In [43]:
head(avg_price_by_unique_id)

unique_id,avg_price
425,623.90854
3273,173.83914
183,227.30782
1983,255.01598
737,225.23344
4892,57.41814


#### 7. Basic Action Counts

In [46]:
# count all recorded actions for each customer
action_counts = train[,.N,c("unique_id","user_action")][order(user_action, -N)]

In [48]:
action_counts[unique_id==319]

unique_id,user_action,N
319,basket,732
319,favorite,679
319,order,30
319,search,2425
319,visit,3256


In [49]:
# change it to long format
action_counts_long = dcast(action_counts, unique_id~user_action, value.var='N')

In [50]:
action_counts_long[unique_id==319]

unique_id,basket,favorite,order,search,visit
319,732,679,30,2425,3256


In [51]:
colnames(action_counts_long)[2:ncol(action_counts_long)] = paste("count", colnames(action_counts_long)[2:ncol(action_counts_long)], sep = "_")

In [52]:
action_counts_long[unique_id==319]

unique_id,count_basket,count_favorite,count_order,count_search,count_visit
319,732,679,30,2425,3256


#### 8. Actions Counts based on Customer Profile

In [53]:
# calculate the total number of actions on Level3_Category by each gender
action_by_gender = train[,.N,c("Level3_Category_Name","gender")][order(Level3_Category_Name)]

In [54]:
action_by_gender_long = dcast(action_by_gender, Level3_Category_Name~gender, value.var='N')

In [55]:
setnafill(action_by_gender_long, type=c("const"), fill=0, cols=c("F", "M"))
# calculate the total actions
action_by_gender_long = action_by_gender_long[,total:=F+M]
# select total action count > 100 is our concern to decrease the noise
action_by_gender_long = action_by_gender_long[total>100,]

In [56]:
# calculate female percent on these actions
action_by_gender_long[ ,F_percent := F/total]
# calculate male percent on these actions
action_by_gender_long[ ,M_percent := M/total]
# flag the categories as female if most of the actions done by female
# 0.8 is selected as thresehold intutively by looking at category names 
action_by_gender_long[, female_categories:=ifelse(F_percent>=0.8,1,0)]

In [57]:
action_by_gender_long[F_percent<=1, ][order(-F_percent)]

Level3_Category_Name,F,M,total,F_percent,M_percent,female_categories
Hamile Gecelik,248,0,248,1.0000000,0.000000000,1
Lohusa Seti,456,0,456,1.0000000,0.000000000,1
Piercing,800,1,801,0.9987516,0.001248439,1
Hamile Pijama,492,1,493,0.9979716,0.002028398,1
Mama Önlügü,379,1,380,0.9973684,0.002631579,1
Sahmeran,345,1,346,0.9971098,0.002890173,1
Hamile Pantolonu,315,1,316,0.9968354,0.003164557,1
Peruk,256,1,257,0.9961089,0.003891051,1
Gögüs Pompasi,140,1,141,0.9929078,0.007092199,1
Yapiskanli Folyo,101,1,102,0.9901961,0.009803922,1


In [58]:
# shows the female dominant categories
action_by_gender_long[female_categories==1]$Level3_Category_Name

In [59]:
# count the number of actions of each customer on these categories
action_count_by_gender_category = train[Level3_Category_Name %in% action_by_gender_long[female_categories==1]$Level3_Category_Name,.N,c("unique_id")]

In [60]:
setnames(action_count_by_gender_category, "N", "female_category_action_count")

In [61]:
# an example
# customer-717 showed actions 1613 times in above categories 
# This may carry information about the customers gender
head(action_count_by_gender_category)

unique_id,female_category_action_count
425,254
3273,36
183,803
1983,760
737,1613
4892,33


In [62]:
# calculate the total number of actions on brand  by each gender
brand_by_gender = train[,.N,c("brand_name","gender")][order(brand_name)]

In [63]:
brand_by_gender_long = dcast(brand_by_gender, brand_name~gender, value.var='N')

In [64]:
setnafill(brand_by_gender_long, type=c("const"), fill=0, cols=c("F", "M"))
# calculate the total actions
brand_by_gender_long[,total:=F+M]
brand_by_gender_long = brand_by_gender_long[,total:=F+M]
# select total action count > 100 is our concern to decrease the noise
brand_by_gender_long = brand_by_gender_long[total>100,]

In [65]:
# calculate female percent on these actions
brand_by_gender_long[ ,F_percent := F/total]
# calculate male percent on these actions
brand_by_gender_long[ ,M_percent := M/total]
# flag the categories as female or male if most of the actions done by female
# threseholds are selected intutively by looking at category names
brand_by_gender_long[, female_brands:=ifelse(F_percent>=0.9,1,0)]
brand_by_gender_long[, male_brands:=ifelse(M_percent>=0.8,1,0)]

In [433]:
brand_by_gender_long[female_brands==1]$brand_name

In [66]:
# count the number of actions of each customer on these female brands
action_count_by_gender_fbrand = train[brand_name %in% brand_by_gender_long[female_brands==1]$brand_name,.N,c("unique_id")]

In [67]:
setnames(action_count_by_gender_fbrand, "N", "female_brand_action_count")

In [68]:
# an example
# customer-737 showed actions 986 times in female brands 
# This may carry information about the customers gender
head(action_count_by_gender_fbrand)

unique_id,female_brand_action_count
425,104
3273,2
183,261
1983,416
737,986
4892,3


In [69]:
# count the number of actions of each customer on these male brands
action_count_by_gender_mbrand = train[brand_name %in% brand_by_gender_long[male_brands==1]$brand_name,.N,c("unique_id")]

In [70]:
setnames(action_count_by_gender_mbrand, "N", "male_brand_action_count")

In [71]:
head(action_count_by_gender_mbrand)

unique_id,male_brand_action_count
3004,1
336,1
236,33
807,3
1058,6
328,7


In [72]:
# calculate the total number of actions on business unit by each gender
# same logic above is applied
businessunit_by_gender = train[,.N,c("businessunit","gender")][order(businessunit)]

In [73]:
businessunit_by_gender_long = dcast(businessunit_by_gender, businessunit~gender, value.var='N')

In [74]:
setnafill(businessunit_by_gender_long, type=c("const"), fill=0, cols=c("F", "M"))
businessunit_by_gender_long[,total:=F+M]
businessunit_by_gender_long = businessunit_by_gender_long[,total:=F+M]
businessunit_by_gender_long = businessunit_by_gender_long[total>100,]

In [75]:
businessunit_by_gender_long[ ,F_percent := F/total]
businessunit_by_gender_long[ ,M_percent := M/total]
businessunit_by_gender_long[, female_bu:=ifelse(F_percent>=0.85,1,0)]
businessunit_by_gender_long[, male_bu:=ifelse(M_percent>=0.5,1,0)]

In [7]:
businessunit_by_gender_long[F_percent<=1, ][order(-F_percent)]

businessunit,F,M,total,F_percent,M_percent,female_bu,male_bu
PL Woman,194967,6604,201571,0.9672374,0.03276265,1,0
Makyaj,32062,1114,33176,0.9664215,0.03357849,1,0
PL Beach,2140,89,2229,0.9600718,0.03992822,1,0
PL Ayakkabi,652,31,683,0.9546120,0.04538799,1,0
Kadin B,198897,11302,210199,0.9462319,0.05376810,1,0
Cilt Bakim,29139,1748,30887,0.9434066,0.05659339,1,0
PL Party & Wedding,3536,247,3783,0.9347079,0.06529210,1,0
Vücut Bakim,3705,303,4008,0.9244012,0.07559880,1,0
Ev Giyim,28787,2436,31223,0.9219806,0.07801941,1,0
Anne & Bebek Bakim,4968,422,5390,0.9217069,0.07829314,1,0


In [76]:
action_count_by_gender_fbusinessunit = train[businessunit %in% businessunit_by_gender_long[female_bu==1]$businessunit,.N,c("unique_id")]

In [77]:
setnames(action_count_by_gender_fbusinessunit, "N", "female_businessunit_action_count")

In [78]:
head(action_count_by_gender_fbusinessunit)

unique_id,female_businessunit_action_count
425,177
3273,22
183,517
1983,711
737,1539
4892,19


In [79]:
action_count_by_gender_mbusinessunit = train[businessunit %in% businessunit_by_gender_long[male_bu==1]$businessunit,.N,c("unique_id")]

In [80]:
setnames(action_count_by_gender_mbusinessunit, "N", "male_businessunit_action_count")

In [81]:
head(action_count_by_gender_mbusinessunit)

unique_id,male_businessunit_action_count
183,25
1983,54
737,25
4892,1
1228,63
1572,12


In [82]:
# calculate the total number of actions on contents by each gender
# same logic above is applied
content_by_gender = train[,.N,c("contentid","gender")][order(contentid)]

In [83]:
content_by_gender_long = dcast(content_by_gender, contentid~gender, value.var='N')

In [84]:
setnafill(content_by_gender_long, type=c("const"), fill=0, cols=c("F", "M"))
content_by_gender_long[,total:=F+M]
content_by_gender_long = content_by_gender_long[,total:=F+M]
content_by_gender_long = content_by_gender_long[total>100,]

In [85]:
content_by_gender_long[ ,F_percent := F/total]
content_by_gender_long[ ,M_percent := M/total]
content_by_gender_long[, female_content:=ifelse(F_percent>=0.85,1,0)]
content_by_gender_long[, male_content:=ifelse(M_percent>=0.55,1,0)]

In [105]:
content_by_gender_long[M_percent<=0.65, ][order(-M_percent)]

contentid,F,M,total,F_percent,M_percent,female_content,male_content
32094871,57,104,161,0.3540373,0.6459627,0,1
32967918,65,117,182,0.3571429,0.6428571,0,1
50219388,40,70,110,0.3636364,0.6363636,0,1
31674772,47,82,129,0.3643411,0.6356589,0,1
32282885,42,70,112,0.3750000,0.6250000,0,1
2058203,53,88,141,0.3758865,0.6241135,0,1
34136363,49,81,130,0.3769231,0.6230769,0,1
32652723,73,108,181,0.4033149,0.5966851,0,1
32612666,98,144,242,0.4049587,0.5950413,0,1
51500581,119,173,292,0.4075342,0.5924658,0,1


In [86]:
action_count_by_gender_fcontent = train[contentid %in% content_by_gender_long[female_content==1]$contentid,.N,c("unique_id")]

In [87]:
setnames(action_count_by_gender_fcontent, "N", "female_content_action_count")

In [88]:
head(action_count_by_gender_fcontent)

unique_id,female_content_action_count
425,21
3273,1
183,6
1983,102
737,146
1228,255


In [90]:
action_count_by_gender_mcontent = train[contentid %in% content_by_gender_long[male_content==1]$contentid,.N,c("unique_id")]

In [91]:
setnames(action_count_by_gender_mcontent, "N", "male_content_action_count")

In [93]:
head(action_count_by_gender_mcontent)

unique_id,male_content_action_count
1983,13
1228,14
863,3
3056,2
1506,1
303,1


#### 9. Session Time

In [3]:
train[,exact_timestamp:=strptime(time_stamp, "%Y-%m-%dT%H:%M:%OSZ")]



In [4]:
train[,exact_date:=date(exact_timestamp)]

In [5]:
train[,exact_time:=format(as.POSIXct(exact_timestamp), format = "%H:%M:%S")]

In [6]:
# make sure the data is ordered
train = train[order(exact_timestamp)]

In [7]:
# calculate the last_action_time by shifting the previous action created by each customer
train[,last_action_time:=shift(exact_timestamp, type="lag", 1), by=c("exact_date", "unique_id")]

In [27]:
train[unique_id==425]

time_stamp,contentid,user_action,sellingprice,product_name,brand_id,brand_name,businessunit,product_gender,category_id,...,Level3_Category_Name,gender,unique_id,type,exact_timestamp,exact_date,exact_time,last_action_time,session_time,is_session
2020-10-14T12:33:25Z,47917498,visit,888.75,Jm Gotha/s 5rl 50 Kc Jimmy Choo Günes Gözlügü,2885,Jimmy Choo,Gözlük A,Kadin,379,...,Günes Gözlügü,F,425,train,2020-10-14 12:33:25,2020-10-14,12:33:25,,NA secs,
2020-10-15T20:47:36Z,43555470,search,79.99,Ekru Balon Kol Detayli Triko Kazak TWOAW21KZ0883,40,TRENDYOLMILLA,PL Woman,Kadin,599,...,Kazak & Hirka,F,425,train,2020-10-15 20:47:36,2020-10-15,20:47:36,,NA secs,
2020-10-15T20:47:44Z,43555469,search,79.99,Siyah Balon Kol Detayli Triko Kazak TWOAW21KZ0883,40,TRENDYOLMILLA,PL Woman,Kadin,599,...,Kazak & Hirka,F,425,train,2020-10-15 20:47:44,2020-10-15,20:47:44,2020-10-15 20:47:36,8.000 secs,1
2020-10-15T21:11:41.565Z,51484879,favorite,182.47,Kadin Siyah Babet H726810504,2179,Fox Shoes,Branded Shoes B,Kadin,410,...,Babet,F,425,train,2020-10-15 21:11:41,2020-10-15,21:11:41,2020-10-15 20:47:44,1437.565 secs,1
2020-10-15T21:11:43Z,51484879,visit,159.98,Kadin Siyah Babet H726810504,2179,Fox Shoes,Branded Shoes B,Kadin,410,...,Babet,F,425,train,2020-10-15 21:11:43,2020-10-15,21:11:43,2020-10-15 21:11:41,1.435 secs,1
2020-10-15T21:13:14Z,31638952,visit,319.98,Kadin Vizon Bot G572442002,2179,Fox Shoes,Branded Shoes B,Kadin,407,...,Bot & Bootie,F,425,train,2020-10-15 21:13:14,2020-10-15,21:13:14,2020-10-15 21:11:43,91.000 secs,1
2020-10-16T15:40:47Z,48624027,visit,47.20,Pembe Organik Antibakteriyel Maske - Royal Family,997441,Lily Armor,Saglik,Unisex,4024,...,Nano&Yikanabilir Maske,F,425,train,2020-10-16 15:40:47,2020-10-16,15:40:47,,NA secs,
2020-10-16T15:41:13Z,31891630,visit,68.80,Kids Çignenebilir 60 Tablet,20719,Redoxon,Saglik,Unisex,2322,...,Gida Takviyesi & Vitamin,F,425,train,2020-10-16 15:41:13,2020-10-16,15:41:13,2020-10-16 15:40:47,26.000 secs,1
2020-10-16T15:42:14Z,6886350,visit,39.24,Niloya Multivitamin 60 Cignenn Jelibon Skt:12/20,20741,Voonka,Saglik,,2322,...,Gida Takviyesi & Vitamin,F,425,train,2020-10-16 15:42:14,2020-10-16,15:42:14,2020-10-16 15:41:13,61.000 secs,1
2020-10-17T20:09:09Z,6669602,search,99.99,Kadin Beyaz Çizgili Elbise 9YAK86095IK,842,Koton,Kadin A,Kadin,1182,...,Elbise,F,425,train,2020-10-17 20:09:09,2020-10-17,20:09:09,,NA secs,


In [8]:
# calculate the difference between two actions
train[,session_time:=exact_timestamp-last_action_time]

In [9]:
# if this difference is less than 30min this is selected as "session"
train[,is_session:=ifelse(session_time<=1800, 1, 0)]

In [10]:
avg_time_spent = train[is_session==1, list(avg_session_time=mean(session_time)), by=c("unique_id")]

In [36]:
# for example, below we can see that males spends higher time than females on average
train[is_session==1,list(avg_session_time=mean(session_time)), by=c("gender")]

gender,avg_session_time
F,55.61345 secs
M,63.28396 secs


### Train Data Preparation

Based on the features calculated above, create the train data, fill NAs with 0 and save it.

In [None]:
fav_brand_by_action[,brand_gender:=as.factor(brand_gender)]
fav_cat_by_action[,cat_gender:=as.factor(cat_gender)]
fav_product_gender_by_action[,fav_product_gender:=as.factor(fav_product_gender)]

actions_by_wDay_long[,c(2:6)] = lapply(actions_by_wDay_long[,c(2:6)] , factor)
actions_by_daytime_long[,c(2:6)] = lapply(actions_by_daytime_long[,c(2:6)] , factor)

In [46]:
cl_dat_train = merge(response, fav_brand_by_action, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, fav_cat_by_action, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, fav_product_gender_by_action, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, actions_by_wDay_long, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, actions_by_daytime_long, by="unique_id", all.x=T)

In [83]:
cl_dat_train = cl_dat_train[,-c("fav_brand", "fav_category")]

In [279]:
cl_dat_train = merge(cl_dat_train, avg_price_by_fav, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, avg_price_by_order, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, avg_price_by_basket, by="unique_id", all.x=T)

In [280]:
cl_dat_train[is.na(fav_avg_price), fav_avg_price := mean(train$sellingprice, na.rm=T)]
cl_dat_train[is.na(order_avg_price), order_avg_price := mean(train$sellingprice, na.rm=T)]
cl_dat_train[is.na(basket_avg_price), basket_avg_price := mean(train$sellingprice, na.rm=T)]

In [281]:
cl_dat_train = merge(cl_dat_train, action_counts_long, by="unique_id", all.x=T)

In [282]:
cl_dat_train[is.na(count_basket), count_basket := 0]
cl_dat_train[is.na(count_favorite), count_favorite := 0]
cl_dat_train[is.na(count_order), count_order := 0]
cl_dat_train[is.na(count_search), count_search := 0]
cl_dat_train[is.na(count_visit), count_visit := 0]

In [283]:
cl_dat_train = merge(cl_dat_train, action_count_by_gender_category, by="unique_id", all.x=T)

In [284]:
cl_dat_train[is.na(female_category_action_count), female_category_action_count := 0]

In [13]:
cl_dat_train = merge(cl_dat_train, action_count_by_gender_fbrand, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, action_count_by_gender_mbrand, by="unique_id", all.x=T)

In [14]:
cl_dat_train[is.na(female_brand_action_count), female_brand_action_count := 0]
cl_dat_train[is.na(male_brand_action_count), male_brand_action_count := 0]

In [45]:
cl_dat_train = merge(cl_dat_train, avg_price_by_unique_id, by="unique_id", all.x=T)

In [46]:
cl_dat_train[is.na(avg_price), avg_price := 0]

In [158]:
cl_dat_train = merge(cl_dat_train, action_count_by_gender_fbusinessunit, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, action_count_by_gender_mbusinessunit, by="unique_id", all.x=T)

In [39]:
cl_dat_train = merge(cl_dat_train, action_count_by_gender_fbusinessunit, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, action_count_by_gender_mbusinessunit, by="unique_id", all.x=T)

In [40]:
cl_dat_train[is.na(female_businessunit_action_count), female_businessunit_action_count := 0]
cl_dat_train[is.na(male_businessunit_action_count), male_businessunit_action_count := 0]

In [41]:
cl_dat_train = merge(cl_dat_train, action_count_by_gender_fcontent, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, action_count_by_gender_mcontent, by="unique_id", all.x=T)

In [42]:
cl_dat_train[is.na(female_content_action_count), female_content_action_count := 0]
cl_dat_train[is.na(male_content_action_count), male_content_action_count := 0]

In [101]:
cl_dat_train = merge(cl_dat_train, order_count_by_gender_category, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, order_count_by_gender_fbrand, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, order_count_by_gender_mbrand, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, order_count_by_gender_fbusinessunit, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, order_count_by_gender_mbusinessunit, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, order_count_by_gender_fcontent, by="unique_id", all.x=T)
cl_dat_train = merge(cl_dat_train, order_count_by_gender_mcontent, by="unique_id", all.x=T)

In [102]:
cl_dat_train[is.na(female_category_order_count), female_category_order_count := 0]
cl_dat_train[is.na(female_brand_order_count), female_brand_order_count := 0]
cl_dat_train[is.na(male_brand_order_count), male_brand_order_count := 0]
cl_dat_train[is.na(female_businessunit_order_count), female_businessunit_order_count := 0]
cl_dat_train[is.na(male_businessunit_order_count), male_businessunit_order_count := 0]
cl_dat_train[is.na(female_content_order_count), female_content_order_count := 0]
cl_dat_train[is.na(male_content_order_count), male_content_order_count := 0]

In [31]:
cl_dat_train = merge(cl_dat_train, avg_time_spent, by="unique_id", all.x=T)

In [32]:
cl_dat_train[is.na(avg_session_time), avg_session_time := 0]

In [18]:
head(cl_dat_train)

unique_id,gender,brand_gender,cat_gender,fav_product_gender,wDay_basket,wDay_favorite,wDay_order,wDay_search,wDay_visit,...,count_visit,female_category_action_count,female_brand_action_count,male_brand_action_count,avg_price,female_businessunit_action_count,male_businessunit_action_count,female_content_action_count,male_content_action_count,avg_session_time
1,F,Kadin,Unisex,Unisex,weekday,weekday,weekday,weekend,weekday,...,745,1147,660,0,166.3069,906,2,190,0,67.6581
2,F,Kadin,Unisex,Kadin,weekday,weekday,weekday,weekend,weekday,...,781,2306,1795,0,154.1907,2239,6,144,1,45.94008
3,F,Kadin,Kadin,Kadin,weekend,Unknown,weekend,weekend,weekday,...,135,0,279,0,271.3165,283,0,10,0,38.43928
4,F,Kadin,Kadin,Kadin,weekday,weekday,weekday,weekday,weekday,...,2878,4232,2784,1,289.8914,4084,127,84,0,71.24705
5,F,Kadin,Kadin,Kadin,weekday,weekday,weekend,weekday,weekday,...,3297,3688,2686,0,137.4751,3619,0,419,0,45.401
6,M,Kadin,Kadin,Unisex,weekday,weekday,weekday,weekday,weekday,...,3392,3684,1442,0,189.8601,3763,220,245,18,39.07162


In [None]:
fwrite(cl_dat_train, "model-data/train.csv", col.names=T, sep=",")

### Test Data Preperation

Apply the same calculation logic for features used in train data, create the features, fill NAs with 0 and save it.

In [48]:
fav_actions_test = test[,.N,c("unique_id","user_action","brand_id")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [49]:
fav_brand_by_action_test = fav_actions_test[,.N,c("unique_id","brand_id")][order(-N)][, head(.SD, 1), by = c("unique_id")]

In [50]:
fav_brand_by_action_test = fav_brand_by_action_test[,1:2]
setnames(fav_brand_by_action_test, "brand_id", "fav_brand")

In [51]:
fav_brand_by_action_test = merge(fav_brand_by_action_test, brand_gender, by.x="fav_brand", by.y="brand_id", all.x=T)
fav_brand_by_action_test[,brand_gender:=ifelse(brand_gender==''|is.na(fav_brand_by_action_test$brand_gender), "Unisex", brand_gender)]

In [52]:
test$concat_Category_Id = paste0(test$Level1_Category_Id, test$Level2_Category_Id)# test$Level3_Category_Id)

In [53]:
fav_cat_test = test[,.N,c("unique_id","user_action","concat_Category_Id")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [54]:
fav_cat_by_action_test = fav_cat_test[,.N,c("unique_id","concat_Category_Id")][order(-N)][, head(.SD, 1), by = c("unique_id")]

In [55]:
fav_cat_by_action_test = fav_cat_by_action_test[,1:2]
setnames(fav_cat_by_action_test, "concat_Category_Id", "fav_category")

In [56]:
fav_cat_by_action_test = merge(fav_cat_by_action_test, cat_gender, by.x="fav_category", by.y="concat_Category_Id", all.x=T)
fav_cat_by_action_test[,cat_gender:=ifelse(cat_gender==''|is.na(fav_cat_by_action_test$cat_gender), "Unisex", cat_gender)]

In [57]:
fav_product_gender_test = test[,.N,c("unique_id","user_action","product_gender")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [58]:
fav_product_gender_by_action_test = fav_product_gender_test[,.N,c("unique_id","product_gender")][order(-N)][, head(.SD, 1), by = c("unique_id")]

In [59]:
fav_product_gender_by_action_test = fav_product_gender_by_action_test[,1:2]
setnames(fav_product_gender_by_action_test, "product_gender", "fav_product_gender")

In [60]:
fav_product_gender_by_action_test[,fav_product_gender:=ifelse(fav_product_gender==''|is.na(fav_product_gender_by_action_test$fav_product_gender), "Unisex", fav_product_gender)]

In [61]:
test$date = as.Date(test$time_stamp)
weekdays1 = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
test$wDay = c('weekend', 'weekday')[(weekdays(test$date) %in% weekdays1)+1L]

In [62]:
actions_by_wDay_test = test[,.N, c("unique_id","user_action","wDay")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [63]:
actions_by_wDay_long_test = dcast(actions_by_wDay_test, unique_id~user_action, value.var='wDay')

In [64]:
colnames(actions_by_wDay_long_test)[2:ncol(actions_by_wDay_long_test)] = paste("wDay", colnames(actions_by_wDay_long_test)[2:ncol(actions_by_wDay_long_test)], sep = "_")

In [65]:
actions_by_wDay_long_test[, 2:6][is.na(actions_by_wDay_long_test[, 2:6])] = "Unknown"

In [66]:
test$timestamp = as.POSIXct(test$time_stamp, format = "%Y-%m-%dT%H:%M:%OS", tz = "UTC")

In [67]:
test$daytime = nightday(test$timestamp)

In [68]:
actions_by_daytime_test = test[,.N, c("unique_id","user_action","daytime")][order(user_action, -N)][, head(.SD, 1), by = c("unique_id","user_action")]

In [69]:
actions_by_daytime_long_test = dcast(actions_by_daytime_test, unique_id~user_action, value.var='daytime')

In [70]:
colnames(actions_by_daytime_long_test)[2:ncol(actions_by_daytime_long_test)] = paste("daytime", colnames(actions_by_daytime_long_test)[2:ncol(actions_by_daytime_long_test)], sep = "_")

In [71]:
actions_by_daytime_long_test[, 2:6][is.na(actions_by_daytime_long_test[, 2:6])] = "Unknown"

In [72]:
fav_brand_by_action_test[,brand_gender:=as.factor(brand_gender)]
fav_cat_by_action_test[,cat_gender:=as.factor(cat_gender)]
fav_product_gender_by_action_test[,fav_product_gender:=as.factor(fav_product_gender)]

actions_by_wDay_long_test[,c(2:6)] = lapply(actions_by_wDay_long_test[,c(2:6)] , factor)
actions_by_daytime_long_test[,c(2:6)] = lapply(actions_by_daytime_long_test[,c(2:6)] , factor)

In [75]:
cl_dat_test = merge(test_id, fav_brand_by_action_test, by="unique_id", all=T)
cl_dat_test = merge(cl_dat_test, fav_cat_by_action_test, by="unique_id", all=T)
cl_dat_test = merge(cl_dat_test, fav_product_gender_by_action_test, by="unique_id", all=T)
cl_dat_test = merge(cl_dat_test, actions_by_wDay_long_test, by="unique_id", all=T)
cl_dat_test = merge(cl_dat_test, actions_by_daytime_long_test, by="unique_id", all=T)

In [85]:
cl_dat_test = cl_dat_test[, -c("fav_brand", "fav_category")]

In [332]:
avg_price_by_fav_test = test[user_action=="favorite", list(fav_avg_price=mean(sellingprice)) ,"unique_id"]

In [333]:
avg_price_by_order_test = test[user_action=="order", list(order_avg_price=mean(sellingprice)) ,"unique_id"]

In [334]:
avg_price_by_basket_test = test[user_action=="basket", list(basket_avg_price=mean(sellingprice)),"unique_id"]

In [335]:
action_counts_test = test[,.N,c("unique_id","user_action")][order(user_action, -N)]

In [336]:
action_counts_long_test = dcast(action_counts_test, unique_id~user_action, value.var='N')

In [337]:
colnames(action_counts_long_test)[2:ncol(action_counts_long_test)] = paste("count", colnames(action_counts_long_test)[2:ncol(action_counts_long_test)], sep = "_")

In [339]:
action_count_by_gender_category_test = test[Level3_Category_Name %in% action_by_gender_long[female_categories==1]$Level3_Category_Name,.N,c("unique_id")]

In [340]:
setnames(action_count_by_gender_category_test, "N", "female_category_action_count")

In [359]:
cl_dat_test = merge(cl_dat_test, avg_price_by_fav_test, by="unique_id", all.x=T)
cl_dat_test = merge(cl_dat_test, avg_price_by_order_test, by="unique_id", all.x=T)
cl_dat_test = merge(cl_dat_test, avg_price_by_basket_test, by="unique_id", all.x=T)

In [360]:
cl_dat_test[is.na(fav_avg_price), fav_avg_price := mean(test$sellingprice, na.rm=T)]
cl_dat_test[is.na(order_avg_price), order_avg_price := mean(test$sellingprice, na.rm=T)]
cl_dat_test[is.na(basket_avg_price), basket_avg_price := mean(test$sellingprice, na.rm=T)]

In [361]:
cl_dat_test = merge(cl_dat_test, action_counts_long_test, by="unique_id", all.x=T)

In [362]:
cl_dat_test[is.na(count_basket), count_basket := 0]
cl_dat_test[is.na(count_favorite), count_favorite := 0]
cl_dat_test[is.na(count_order), count_order := 0]
cl_dat_test[is.na(count_search), count_search := 0]
cl_dat_test[is.na(count_visit), count_visit := 0]

In [363]:
cl_dat_test = merge(cl_dat_test, action_count_by_gender_category_test, by="unique_id", all.x=T)

In [364]:
cl_dat_test[is.na(female_category_action_count), female_category_action_count := 0]

In [18]:
action_count_by_gender_fcategory_test = test[brand_name %in% brand_by_gender_long[female_brands==1]$brand_name,.N,c("unique_id")]

In [19]:
setnames(action_count_by_gender_fcategory_test, "N", "female_brand_action_count")

In [21]:
action_count_by_gender_mcategory_test = test[brand_name %in% brand_by_gender_long[male_brands==1]$brand_name,.N,c("unique_id")]

In [22]:
setnames(action_count_by_gender_mcategory_test, "N", "male_brand_action_count")

In [24]:
cl_dat_test = merge(cl_dat_test, action_count_by_gender_fcategory_test, by="unique_id", all.x=T)
cl_dat_test = merge(cl_dat_test, action_count_by_gender_mcategory_test, by="unique_id", all.x=T)

In [25]:
cl_dat_test[is.na(female_brand_action_count), female_brand_action_count := 0]
cl_dat_test[is.na(male_brand_action_count), male_brand_action_count := 0]

In [164]:
avg_price_by_unique_id_test = train[,list(avg_price=mean(sellingprice, na.rm=T)), "unique_id"]

In [176]:
cl_dat_test = merge(cl_dat_test, avg_price_by_unique_id_test, by="unique_id", all.x=T)

In [177]:
cl_dat_test[is.na(avg_price), avg_price := 0]

In [46]:
cl_dat_test[,female_businessunit_action_count:=NULL]
cl_dat_test[,male_businessunit_action_count:=NULL]

In [47]:
action_count_by_gender_fbusinessunit_test = test[businessunit %in% businessunit_by_gender_long[female_bu==1]$businessunit,.N,c("unique_id")]

In [48]:
setnames(action_count_by_gender_fbusinessunit_test, "N", "female_businessunit_action_count")

In [49]:
action_count_by_gender_mbusinessunit_test = test[businessunit %in% businessunit_by_gender_long[male_bu==1]$businessunit,.N,c("unique_id")]

In [50]:
setnames(action_count_by_gender_mbusinessunit_test, "N", "male_businessunit_action_count")

In [51]:
cl_dat_test = merge(cl_dat_test, action_count_by_gender_fbusinessunit_test, by="unique_id", all.x=T)
cl_dat_test = merge(cl_dat_test, action_count_by_gender_mbusinessunit_test, by="unique_id", all.x=T)

In [52]:
cl_dat_test[is.na(female_businessunit_action_count), female_businessunit_action_count := 0]
cl_dat_test[is.na(male_businessunit_action_count), male_businessunit_action_count := 0]

In [54]:
action_count_by_gender_fcontent_test = test[contentid %in% content_by_gender_long[female_content==1]$contentid,.N,c("unique_id")]

In [55]:
setnames(action_count_by_gender_fcontent_test, "N", "female_content_action_count")

In [56]:
action_count_by_gender_mcontent_test = test[contentid %in% content_by_gender_long[male_content==1]$contentid,.N,c("unique_id")]

In [57]:
setnames(action_count_by_gender_mcontent_test, "N", "male_content_action_count")

In [58]:
cl_dat_test = merge(cl_dat_test, action_count_by_gender_fcontent_test, by="unique_id", all.x=T)
cl_dat_test = merge(cl_dat_test, action_count_by_gender_mcontent_test, by="unique_id", all.x=T)

In [59]:
cl_dat_test[is.na(female_content_action_count), female_content_action_count := 0]
cl_dat_test[is.na(male_content_action_count), male_content_action_count := 0]

In [19]:
test[,exact_timestamp:=strptime(time_stamp, "%Y-%m-%dT%H:%M:%OSZ")]



In [20]:
test[,exact_date:=date(exact_timestamp)]

In [21]:
test[,exact_time:=format(as.POSIXct(exact_timestamp), format = "%H:%M:%S")]

In [22]:
test = test[order(exact_timestamp)]

In [23]:
test[,last_action_time:=shift(exact_timestamp, type="lag", 1), by=c("exact_date", "unique_id")]

In [24]:
test[,session_time:=exact_timestamp-last_action_time]

In [25]:
test[,is_session:=ifelse(session_time<=1800, 1, 0)]

In [26]:
avg_time_spent_test = test[is_session==1,list(avg_session_time=mean(session_time)), by=c("unique_id")]

In [27]:
cl_dat_test = merge(cl_dat_test, avg_time_spent_test, by="unique_id", all.x=T)

In [28]:
cl_dat_test[is.na(avg_session_time), avg_session_time := 0]

In [39]:
head(cl_dat_test)

unique_id,brand_gender,cat_gender,fav_product_gender,wDay_basket,wDay_favorite,wDay_order,wDay_search,wDay_visit,daytime_basket,...,count_visit,female_category_action_count,female_brand_action_count,male_brand_action_count,avg_price,female_businessunit_action_count,male_businessunit_action_count,female_content_action_count,male_content_action_count,avg_session_time
9,Kadin,Kadin,Kadin,weekday,weekday,weekday,weekday,weekday,night,...,689,1000,958,4,0,1277,20,168,2,65.38195 secs
18,Kadin,Kadin,Kadin,weekday,weekday,weekday,weekday,weekday,night,...,6300,8006,5845,0,0,7721,88,915,4,59.00120 secs
21,Kadin,Kadin,Kadin,weekday,weekday,weekday,weekday,weekday,night,...,1174,1594,970,0,0,1437,14,111,0,43.51772 secs
25,Unisex,Unisex,Kadin,weekday,weekend,weekday,weekend,weekend,afternoon,...,647,938,523,0,0,856,7,48,0,38.79783 secs
31,Kadin,Unisex,Kadin,weekday,weekday,weekday,weekday,weekday,night,...,5405,6213,3701,0,0,5769,40,664,1,22.14794 secs
32,Kadin,Kadin,Kadin,weekday,weekday,weekday,weekday,weekday,night,...,836,1078,770,0,0,824,0,138,0,27.35902 secs


In [None]:
fwrite(cl_dat_test, "model-data/test.csv", col.names=T, sep=",")