## 1. Accessing the AcquireShoppers Database

Before we can start our data exploration, we need to be able to connect to the *AcquireShoppers* database from the *R* command line. To start, make sure that you install the *RMySQL* package and load it onto the *R* workspace. Once, that is done, we can setup a connection to the MySQL server (i.e. locally-hosted on your laptop!).

In [96]:
library(RMySQL)
con <- dbConnect(MySQL(), user = 'root', password = 'root', host = 'localhost', dbname = 'AcquireShoppers')

The database contains a smaller, down-sampled version of the *transactions*, *trainHistory* and *testHistory* tables. These tables are sufficiently small, that it is feasible to perform joins on these tables and still store them in memory.

In order to facilitate fast analysis, we wish to index some of the critical fields of the transactions table. This table is by far the largest table, so indexing the right columns will significantly speed up future analysis. The competition states that the *transactions* table can be joined to *trainHistory* and *testHistory* using *(id, chain)*. We thus choose to index these fields for the tables *transactions*, *trainHistory* and *testHistory*.

In addition the Kaggle benchmarks use the *brand*, *company* and *category* fields. Hence these fields seem important and we also choose to index them. **The following commands will also take significantly long to run.**

We can check if we have successfully established a connection to the *AcquireShoppers* database by loading the smallest table: *offers*.

In [97]:
offers <- dbReadTable(conn = con, name = 'offers')
head(offers)

offer,category,quantity,company,offervalue,brand
1190530,9115,1,108500080,5.0,93904
1194044,9909,1,107127979,1.0,6732
1197502,3203,1,106414464,0.75,13474
1198271,5558,1,107120272,1.5,5072
1198272,5558,1,107120272,1.5,5072
1198273,5558,1,107120272,1.5,5072


Success! As this analysis is exploratory, we can choose to focus on the sampled versions of the tables created within the *databaseSetup.ipynb* script: *testHistorySmall*, *trainHistorySmall* and *transactionsSmall*. These down-sampled files are only ~2GB and so should fit within the memory of most modern laptops.

To get the best possible score on the Kaggle challenge, we will later need to use the full dataset, but let's for now just focus on the down-sampled version. We stary by loading the down-sampled tables into memory.

In [98]:
testHistory <- dbReadTable(conn = con, name = 'testHistorySmall')
testHistory$offerdate <- as.Date(testHistory$offerdate)
print('testHistory table')
head(testHistory)
trainHistory <- dbReadTable(conn = con, name = 'trainHistorySmall')
trainHistory$offerdate <- as.Date(trainHistory$offerdate)
print('trainHistory table')
head(trainHistory)
transactions <- dbReadTable(conn = con, name = 'transactionsSmall')
transactions$date <- as.Date(transactions$date)
print('transactions table')
head(transactions)

[1] "testHistory table"


Id,chain,offer,market,offerdate
12524696,4,1221665,1,2013-06-20
15417308,95,1213242,39,2013-05-22
19166969,18,1221658,11,2013-06-20
51472749,17,1221663,4,2013-06-20
52015552,18,1221658,11,2013-06-22
52584180,18,1221663,11,2013-06-22


[1] "trainHistory table"


Id,chain,offer,market,repeattrips,repeater,offerdate
50622160,18,1197502,11,0,f,2013-03-31
50808053,3,1197502,2,0,f,2013-04-02
51150009,4,1197502,1,0,f,2013-03-25
59067761,95,1204822,39,0,f,2013-04-19
66507967,88,1197502,14,0,f,2013-03-25
70217582,15,1204822,9,0,f,2013-04-23


[1] "transactions table"


Id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
50622160,18,26,2622,102113020,15704,2012-03-08,40,CT,1,3.19
50622160,18,2,202,101600010,1756,2012-03-08,18,OZ,2,4.5
50622160,18,26,2622,102113020,15704,2012-03-08,30,CT,1,3.19
50622160,18,27,2704,107910070,30096,2012-03-08,24,OZ,1,4.99
50622160,18,25,2506,102113020,10786,2012-03-08,32,OZ,1,2.3
50622160,18,37,3708,1076026373,1537,2012-03-08,12,OZ,1,4.49


In *databaseSetup.ipynb*, we saw that tables *offers*, *testHistory* and *trainHistory* had the fields *offer*, *id* and *id* respectively as their primary keys. Furthermore *offer* served as a foreign key in both *trainHistory* and *testHistory* whilst *id* was a foreign key (now shown as *Id*) within *transactions*. The diagram below summarizes these relationships:
<img src="files/databases.png">

Apart from these keys, we need to get an understanding of what the other fields mean. We undertake this via data exploration.

## 2. Understanding each table

We start by checking the number of NaNs and NAs within each table:

In [26]:
print("Number of NaNs within offers: ")
print(sum(sapply(1:ncol(offers), function(x) sum(is.nan(offers[1:nrow(offers), x])))))
print("Number of NaNs within testHistory: ")
print(sum(sapply(1:ncol(testHistory), function(x) sum(is.nan(testHistory[1:nrow(testHistory), x])))))
print("Number of NaNs within trainHistory: ")
print(sum(sapply(1:ncol(trainHistory), function(x) sum(is.nan(trainHistory[1:nrow(trainHistory), x])))))
print("Number of NaNs within transactions: ")
print(sum(sapply(1:ncol(transactions), function(x) sum(is.nan(transactions[1:nrow(transactions), x])))))

[1] "Number of NaNs within offers: "
[1] 0
[1] "Number of NaNs within testHistory: "
[1] 0
[1] "Number of NaNs within trainHistory: "
[1] 0
[1] "Number of NaNs within transactions: "
[1] 0


In [27]:
print("Number of NAs within offers: ")
print(sum(sapply(1:ncol(offers), function(x) sum(is.na(offers[1:nrow(offers), x])))))
print("Number of NAs within testHistory: ")
print(sum(sapply(1:ncol(testHistory), function(x) sum(is.na(testHistory[1:nrow(testHistory), x])))))
print("Number of NAs within trainHistory: ")
print(sum(sapply(1:ncol(trainHistory), function(x) sum(is.na(trainHistory[1:nrow(trainHistory), x])))))
print("Number of NAs within transactions: ")
print(sum(sapply(1:ncol(transactions), function(x) sum(is.na(transactions[1:nrow(transactions), x])))))

[1] "Number of NAs within offers: "
[1] 0
[1] "Number of NAs within testHistory: "
[1] 0
[1] "Number of NAs within trainHistory: "
[1] 0
[1] "Number of NAs within transactions: "
[1] 0


Luckily, we find that there are no missing values within any of the tables. Let's examine each table more closely

### 2.1. The Offers Table

The *offers* table has the fields *category*, *quantity*, *company*, *offervalue* and *brand*. The fields *quantity* and *offervalue* are unique to the table *offers*. In order to gauge an understanding of these two fields, let's first see how many distinct values they each have. **Notice that *category*, *company* and *brand* also appear in the transactions table!**

In [33]:
print("Table of offervalue field")
table(as.factor(offers$offervalue))

[1] "Table of offervalue field"



0.75    1 1.25  1.5    2    3    5 
   1    5    3   19    6    2    1 

In [32]:
print("Table of quantity field")
table(as.factor(offers$quantity))

[1] "Table of quantity field"



 1  2 
36  1 

The *quantity* field is - with one exception - always equal to one. Let's find the row the quantity of 2 corresponds to.

In [35]:
offers[offers$quantity == 2, ]

Unnamed: 0,offer,category,quantity,company,offervalue,brand
32,1221658,7205,2,103700030,3,4294


Does the fact that this is the only coupon that offers 2 of a product make shoppers offered this coupon more likely to buy the product?

In [47]:
print("Proportion of \'Quantity-2\' coupoun is trainHistory:")
sum(trainHistory$offer == 1221658) / nrow(trainHistory)
print("Proportion of \'Quantity-2\' coupoun is testHistory:")
sum(testHistory$offer == 1221658) / nrow(testHistory)

[1] "Proportion of 'Quantity-2' coupoun is trainHistory:"


[1] "Proportion of 'Quantity-2' coupoun is testHistory:"


The 'Quantity-2' coupon is never offered to the Shoppers in the training set but it is offered to 20.9% of Shoppers in the test set! This raises the question of how different the coupon offers are between the training and test set.

In [78]:
print("train     test")
props <- t(sapply(1:nrow(offers), function(x) c(x, sum(trainHistory$offer == offers[x, 1]) / nrow(trainHistory), sum(testHistory$offer == offers[x, 1]) / nrow(testHistory))))
print(props)
print("")
print("Proportion of Train Shoppers offered Coupons that are NEVER offered to Test Shoppers:")
sum(props[props[, 3] == 0, 2])
print("Proportion of Test Shoppers offered Coupons that are NEVER offered to Train Shoppers:")
sum(props[props[, 2] == 0, 3])

[1] "train     test"
      [,1]        [,2]         [,3]
 [1,]    1 0.000000000 0.0132647537
 [2,]    2 0.041025641 0.0000000000
 [3,]    3 0.286153846 0.0000000000
 [4,]    4 0.007051282 0.0005414185
 [5,]    5 0.012179487 0.0012181917
 [6,]    6 0.008076923 0.0002707093
 [7,]    7 0.006410256 0.0018949648
 [8,]    8 0.012948718 0.0008121278
 [9,]    9 0.050256410 0.0000000000
[10,]   10 0.045384615 0.0000000000
[11,]   11 0.019871795 0.0000000000
[12,]   12 0.008076923 0.0000000000
[13,]   13 0.029871795 0.0059556037
[14,]   14 0.009871795 0.0043313481
[15,]   15 0.001025641 0.0009474824
[16,]   16 0.008974359 0.0000000000
[17,]   17 0.090128205 0.0000000000
[18,]   18 0.000000000 0.0851380617
[19,]   19 0.045641026 0.0040606389
[20,]   20 0.014871795 0.0004060639
[21,]   21 0.015000000 0.0001353546
[22,]   22 0.093974359 0.0033838657
[23,]   23 0.020641026 0.0008121278
[24,]   24 0.117179487 0.0028424472
[25,]   25 0.041794872 0.0048727666
[26,]   26 0.013589744 0.0047374120
[27,]  

[1] "Proportion of Test Shoppers offered Coupons that are NEVER offered to Train Shoppers:"


**Virtually none of the Test Shoppers are offered a coupon that is also offered to any of the Shoppers in the Training set!** Hopefully, there is more of an overlap in the test and training shoppers with the categories, companies and brands.

In [77]:
length(table(as.factor(offers$category)))
length(table(as.factor(offers$company)))
length(table(as.factor(offers$brand)))

In [79]:
length(table(as.factor(transactions$category)))
length(table(as.factor(transactions$company)))
length(table(as.factor(transactions$brand)))

In [93]:
head(as.Date(transactions$date))

[1] "2012-03-08" "2012-03-08" "2012-03-08" "2012-03-08" "2012-03-08"
[6] "2012-03-08"

In [84]:
format(as.Date(tolower('2012-03-08'), format='%y-%m-%d'), format='%d/%m/%y')

In [88]:
as.Date('2012-03-08', format='%y-%d-%m')

[1] NA

In [89]:
typeof(transactions$date[1])

In [90]:
transactions$date[1]

In [92]:
typeof(as.Date('2012-03-08'))

In [101]:
c(min(transactions$date[transactions$Id==]),max(transactions$date))

[1] "2012-03-02" "2013-07-28"

In [102]:
c(min(trainHistory$offerdate), max(trainHistory$offerdate))

[1] "2013-03-01" "2013-04-30"

In [112]:
c(min(testHistory$offerdate), max(testHistory$offerdate))

[1] "2013-05-01" "2013-07-29"

In [103]:
transactions[transactions$Id==50622160,]

Id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
50622160,18,26,2622,102113020,15704,2012-03-08,40,CT,1,3.19
50622160,18,2,202,101600010,1756,2012-03-08,18,OZ,2,4.50
50622160,18,26,2622,102113020,15704,2012-03-08,30,CT,1,3.19
50622160,18,27,2704,107910070,30096,2012-03-08,24,OZ,1,4.99
50622160,18,25,2506,102113020,10786,2012-03-08,32,OZ,1,2.30
50622160,18,37,3708,1076026373,1537,2012-03-08,12,OZ,1,4.49
50622160,18,37,3708,1076026373,1537,2012-03-08,9,OZ,1,0.00
50622160,18,5,519,103000030,14760,2012-03-08,15,OZ,1,3.49
50622160,18,9,907,102113020,15704,2012-03-08,24,OZ,2,4.00
50622160,18,92,9208,102764828,7487,2012-03-15,1,CT,1,3.49


In [111]:
dim(transactions[transactions$date >= as.Date('2013-03-01'), ])

In [108]:
dim(transactions)

In [109]:
transactions[transactions$Id==50808053,]

Unnamed: 0,Id,chain,dept,category,company,brand,date,productsize,productmeasure,purchasequantity,purchaseamount
273,50808053,3,25,2508,102529323,16410,2012-03-04,64,OZ,1,3.69
274,50808053,3,56,5610,102113020,4720,2012-03-04,4,OZ,1,4.29
275,50808053,3,21,2106,103120030,12957,2012-03-04,101,OZ,1,4.99
276,50808053,3,57,5704,107047070,20230,2012-03-04,6,OZ,10,6.00
277,50808053,3,9,912,104812141,18322,2012-03-04,20,OZ,1,2.99
278,50808053,3,56,5615,102113020,10786,2012-03-04,8,OZ,1,2.99
279,50808053,3,31,3101,102113020,15704,2012-03-04,4,OZ,1,4.39
280,50808053,3,9,902,107989373,29344,2012-03-04,8,OZ,1,3.29
281,50808053,3,21,2114,102113020,15704,2012-03-04,15,OZ,1,1.99
282,50808053,3,63,6321,104082242,15667,2012-03-04,10,OZ,1,2.99


In [134]:
min(sapply(1:1000, function(x) as.Date(min(transactions$date[transactions$Id == trainHistory$Id[x]]), origin = "1970-01-01")))

In [117]:
head(transactions$date)

[1] "2012-03-08" "2012-03-08" "2012-03-08" "2012-03-08" "2012-03-08"
[6] "2012-03-08"

In [118]:
head(min(transactions$date))

[1] "2012-03-02"

In [132]:
as.Date(15401, origin = "1970-01-01")

[1] "2012-03-02"

In [135]:
nrow(trainHistory)