# Store price prediction


## 1. Loading prerequisites
#### 1.1. Libraries

In [1]:
suppressWarnings({
    if(!require(dplyr)) install.packages("dplyr")
    if(!require(ggplot2)) install.packages("ggplot2")
    if(!require(caret)) install.packages("caret")
    if(!require(psych)) install.packages("psych")
    if(!require(mlbench)) install.packages("mlbench")
    if(!require(AppliedPredictiveModeling)) install.packages("AppliedPredictiveModeling")
    #if(!require(Hmisc)) install.packages("Hmisc",dependencies = TRUE)
    library(dplyr)
    library(ggplot2)
    library(caret)
    library(psych)
    library(mlbench)
    library(AppliedPredictiveModeling)
    #library(Hmisc)
})
options(repr.plot.width=6, repr.plot.height=4)
transparentTheme(trans = .4)

Loading required package: dplyr

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Loading required package: ggplot2
Loading required package: caret
Loading required package: lattice
Loading required package: psych

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

Loading required package: mlbench
Loading required package: AppliedPredictiveModeling


#### 1.2. Data

In [2]:
DATA_PATH = "./data/"
train_raw = read.table(paste0(DATA_PATH,"Train.csv"), header = T, sep = ",")
test_raw = read.table(paste0(DATA_PATH,"Test.csv"), header = T, sep = ",")
pred_true = read.table(paste0(DATA_PATH,"Sample Submission.csv"), header = T, sep = ",")
dim(train_raw); dim(test_raw); dim(pred_true)

## 2. Exploratory data analysis

#### 2.1. Viewing the data

In [3]:
head(train_raw, 10)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
6141,1583,144,3,2011-05-06 16:54:00,3.75,14056,35
6349,1300,3682,6,2011-05-11 07:35:00,1.95,13098,35
16783,2178,1939,4,2011-11-20 13:20:00,5.95,15044,35
16971,2115,2983,1,2011-11-22 12:07:00,0.83,15525,35
6080,1210,2886,12,2011-05-06 09:00:00,1.65,13952,35
17388,495,3247,5,2011-11-27 12:52:00,1.65,15351,35
18494,165,3377,1,2011-12-08 20:01:00,1.25,12748,35
17109,2597,3435,1,2011-11-23 12:40:00,1.25,16255,35
17143,1945,2352,1,2011-11-23 14:07:00,5.75,17841,35
8422,3311,2502,6,2011-06-22 10:11:00,2.95,13849,35


#### 2.2. Studying the original structure of the data

In [4]:
sapply(data, class)

data = mutate(
    train_raw,
    InvoiceNo = factor(InvoiceNo),
    StockCode = factor(StockCode),
    Description = factor(Description),
    Quantity = as.numeric(Quantity),
    InvoiceDate = as.POSIXlt(InvoiceDate),
    UnitPrice = as.numeric(UnitPrice),
    CustomerID = factor(CustomerID),
    Country = factor(Country)
)

In [5]:
data = train_raw

#### 2.3 Summarizing the data in training set

In [6]:
summary(data)

   InvoiceNo       StockCode     Description      Quantity        
 Min.   :    0   Min.   :   0   Min.   :   0   Min.   :-80995.00  
 1st Qu.: 5069   1st Qu.: 939   1st Qu.:1141   1st Qu.:     2.00  
 Median :10310   Median :1521   Median :1987   Median :     5.00  
 Mean   : 9955   Mean   :1573   Mean   :2024   Mean   :    12.03  
 3rd Qu.:14657   3rd Qu.:2106   3rd Qu.:2945   3rd Qu.:    12.00  
 Max.   :22188   Max.   :3683   Max.   :3895   Max.   : 80995.00  
                                                                  
              InvoiceDate       UnitPrice          CustomerID   
 2011-11-28 15:54:00:   385   Min.   :    0.00   Min.   :12346  
 2011-11-14 15:27:00:   384   1st Qu.:    1.25   1st Qu.:13953  
 2011-12-05 17:17:00:   361   Median :    1.95   Median :15152  
 2011-10-31 14:09:00:   311   Mean   :    3.45   Mean   :15288  
 2011-11-23 13:39:00:   307   3rd Qu.:    3.75   3rd Qu.:16794  
 2011-09-21 14:40:00:   273   Max.   :38970.00   Max.   :18287  
 (Other) 

#### 2.5. Check for missing data

In [7]:
sapply(data, function(x){sum(is.na(x))})

#### 2.6. Count of unique values in each fields

In [8]:
sapply(data, function(x){length(unique(x))})

In [9]:
sapply(data[,-c(1,5)], function(x){round(cor(x,data[,1]),2)})

This machine learning problem aims to predict the price of items in store, not forecast trend in prices overtime, therefore we may remove the InvoiceDate field.  

Also the InvoiceNo is very little correlated to the other columns hence can't be used to fit the model

In [10]:
data = select(data, select = -c(InvoiceNo,InvoiceDate))

In [11]:
features = names(data)[-4]

In [None]:
featurePlot(x = data[, features], 
            y = data$UnitPrice, 
            plot = "pairs")
#             type = c("p", "smooth")),
#             span = .5,
#             layout = c(3, 2))

In [None]:
preObj = preProcess(data, method=c("center","scale"))
data_scaled = predict(preObj,data)