### Data Mining course
**Student:**    Danis Alukaev <br>
**Email:**      d.alukaev@innopolis.university <br>
**Group:**      B19-DS-01

## Table of contents

- Prerequisites
  - Select modeling technique
    - Recommender System
    - Logistic Optimization
    - Predicting Cancellations
    - Predicting the Total Revenue.
  - Generate test design
  - Build model
  - Assess model
- Modeling

## Prerequisites

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib_inline
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from mlxtend.frequent_patterns import apriori, association_rules
import random
import pprint
import warnings
import os
import nltk

%matplotlib inline
random.seed(42)

In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

warnings.filterwarnings('ignore')

In [3]:
ds_dir = './data/'
ds = 'completed_orders.csv'
path = os.path.join(ds_dir, ds)

data = pd.read_csv(path)

In [4]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Total,Year,Month,Day,Weekday,Hour,Segment,Population,Wage,Holiday
0,536365,8512,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
1,536365,7105,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
2,536365,8440,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
3,536365,8402,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
4,536365,8402,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False


# Modeling
---------------------
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

## Select modeling techuique
----------


### Task

As the first step in modeling, select the actual modeling technique that is to be used. Whereas you possibly already selected a tool in business understanding, this task refers to the specific modeling technique, e.g.,decision tree building with C4.5 or neural network generation with back propagation. If multiple techniques are applied, perform this task for each technique separately.

### Output

#### Modeling technique

Document the actual modeling technique that is to be used.

#### Modeling assumptions

Many modeling techniques make specific assumptions on the data, e.g.,all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.

----------------

As a data mining goals I have multiple topics: recommender system, cost optimization, prediction of cancellations and total revenue. These are the notes regarding each of them. 

### 1. Recommender system
As a baseline I will take Association Rules learning Apriori algorithm. It determines the items that are likely to be bought together. It does not take into account any of the features except the buckets with items. The main limitation is costly wasting of time to hold a vast number of candidate sets with much frequent itemsets, low minimum support or large itemsets. One of the major assumptions about the data is that there are no misspellings in the item names.

The first option will be to use a Neural Network to predict the content of cart based on the ID of customer, date/time, country, segement and wage. The data here should be balanced (no bias). It is also important that the categorical features should encoded and for all of the applied scaler.

[?] The second option will be to use an External Recommender system based on DBSCAN. For clusterization we use only cusomer ID and the chosen items. As a result we will have each customer associated with a cluster, so that the entire group has set of items in common. This items will be our suggestion. The idea might be developed via computation the distance between the new user and other one. The purchase of the closes user will be used for suggestion.

### 2. Logistic optimization

This can be solved using the the multi-output regression, and the baseline will Multi-Layer Perceptron (MLP). On the input there will be date, country, population, wage, and number of holidays. The output is the expected number of items to be purchased. 

### 3. Predicting cancellations

This is the binary classification problem that can be solved by using Neural Network. To determine whether the order will be cancelled there can be used features related to the products purchased, country date, segment, and whether it is holiday.

### 4. Predicting the total revenue

This task relates to sequence-to-sequence models, so Pr.Beklaryan said that we try something in this field in next labs.

## Generate test design
------------------




### Task

Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.

### Output

Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation datasets.

-----------------------

### 1. Recommender System

The recommender system is a tricky field that cannot be properly tested without external validation. Frankly speaking, the best way is to design the A/B testing, and check how the new approach influenced the satisfaction from the suggestion, average cart, number of clicks on the suggested items, etc.

Still, there can be used traditional MAE, MSE and other metrics, but unfortunately we cannot relate it with our data. The thing is that we can only compare our suggestions with the items purson bought, and these are the items that customer will buy anyway. The solution will be to create infrastructure for testing on step 6.

### 2. Logistic Optimization
As a metric for regression I would use Root-Mean-Squared Error (RMSE). The data is to be splitted into 70% for train, 20% for test, and 10% for validation. The optimizer will be the AdamW with a One Cycle LR scheduler.

### 3. Predicting Cancellations
Since it is the binary classification problem the metric might be F1-score (rel. precision, recall). The data is to be splitted into 70% for train, 20% for test, and 10% for validation. The optimizer will be the AdamW with a One Cycle LR scheduler.

### 4. Predicting the total revenue
TBD later in the course. But from my experience - Pearcon coefficient and split of data - 80% for train, 10% for test, and 10% for validation.

## Build model
----------



### Task

Run the modeling tool on the prepared dataset to create one or more models.

### Output

#### Parameter settings 

With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen value, along with the rationale for the choice of parameter settings. 

#### Models 

These are the actual models produced by the modeling tool, not a report.

#### Model description

Describe the resultant model. Report on the interpretation of the models and document any difficulties encountered with their meanings.

--------------

### 1. Recommender system

#### Association Rules

In [5]:
df = data.copy()

The recommender system will be different for each country, consider building one for France customers. The bigger counrties will occupy much more memory (see limitations in select modeling technique). 

In [19]:
df_uk = df[df.Country == 'France']

In [20]:
df_uk

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Total,Year,Month,Day,Weekday,Hour,Segment,Population,Wage,Holiday
478837,536370,2272,alarm clock bakelike pink,24,2010-12-01 08:45:00,3.75,12583,France,90.00,2010,12,1,Wednesday,8,Champions,62.791,41548,False
478838,536370,2272,alarm clock bakelike red,24,2010-12-01 08:45:00,3.75,12583,France,90.00,2010,12,1,Wednesday,8,Champions,62.791,41548,False
478839,536370,2272,alarm clock bakelike green,12,2010-12-01 08:45:00,3.75,12583,France,45.00,2010,12,1,Wednesday,8,Champions,62.791,41548,False
478840,536370,2172,panda and bunnies sticker sheet,12,2010-12-01 08:45:00,0.85,12583,France,10.20,2010,12,1,Wednesday,8,Champions,62.791,41548,False
478841,536370,2188,stars gift tape,24,2010-12-01 08:45:00,0.65,12583,France,15.60,2010,12,1,Wednesday,8,Champions,62.791,41548,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
486917,580986,2202,rainy ladies birthday card,12,2011-12-06 16:34:00,0.42,12650,France,5.04,2011,12,6,Tuesday,16,Recent Customers,62.791,41548,False
486918,580986,2202,tea party birthday card,12,2011-12-06 16:34:00,0.42,12650,France,5.04,2011,12,6,Tuesday,16,Recent Customers,62.791,41548,False
486919,580986,2202,spaceboy birthday card,12,2011-12-06 16:34:00,0.42,12650,France,5.04,2011,12,6,Tuesday,16,Recent Customers,62.791,41548,False
486920,580986,2357,snack tray i love london,8,2011-12-06 16:34:00,1.95,12650,France,15.60,2011,12,6,Tuesday,16,Recent Customers,62.791,41548,False


In [21]:
df_items = df_uk.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().fillna(0).applymap(lambda x: 1 if x > 0 else 0)

The algorith charachtarizes by three main components:
1. Support - the probability of an event to occur.
2. Confidence - a measure of conditional probability
3. Lift - the probability of all items occurring together divided by the product of antecedent and consequent occurring as if they are independent of each other.

I found it essential to have a constraint on support to be at least 1% probable (for an environment given by the data).

In [22]:
item_set = apriori(df_items, min_support=0.01, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.02356,( dolly girl beaker)
1,0.013089,( i love london mini backpack)
2,0.018325,( set 2 tea towels i love london )
3,0.041885,( spaceboy baby gift set)
4,0.031414,(10 colour spaceboy pen)


In [25]:
item_set.sort_values(by='support').head()

Unnamed: 0,support,itemsets
14907,0.010471,"(lunch bag spaceboy design , lunch box with cu..."
19858,0.010471,"(alarm clock bakelike green, skull lunch box w..."
19857,0.010471,"(alarm clock bakelike green, plasters in tin c..."
19855,0.010471,"(alarm clock bakelike green, skull lunch box w..."
19854,0.010471,"(alarm clock bakelike green, skull lunch box w..."
...,...,...
401,0.162304,(round snack boxes set of4 woodland )
316,0.172775,(plasters in tin circus parade )
321,0.175393,(plasters in tin woodland animals)
371,0.185864,(red toadstool led night light)


As you might see the great thing about this approach is also that it can retrieve single popular items.

In [24]:
rules = association_rules(item_set, metric="support", min_threshold=0.01)
rules.sort_values(by='antecedent support').head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( dolly girl beaker),(charlotte bag dolly girl design),0.023560,0.068063,0.013089,0.555556,8.162393,0.011485,2.096859
1,(charlotte bag dolly girl design),( dolly girl beaker),0.068063,0.023560,0.013089,0.192308,8.162393,0.011485,1.208925
2,( dolly girl beaker),(dolly girl childrens bowl),0.023560,0.047120,0.018325,0.777778,16.506173,0.017214,4.287958
3,(dolly girl childrens bowl),( dolly girl beaker),0.047120,0.023560,0.018325,0.388889,16.506173,0.017214,1.597811
4,( dolly girl beaker),(dolly girl childrens cup),0.023560,0.041885,0.015707,0.666667,15.916667,0.014720,2.874346
...,...,...,...,...,...,...,...,...,...
1139817,(red retrospot mini cases),"(jumbo bag apples, lunch box i love london, al...",0.141361,0.010471,0.010471,0.074074,7.074074,0.008991,1.068691
1139818,(lunch bag apple design),"(jumbo bag apples, lunch box i love london, al...",0.128272,0.010471,0.010471,0.081633,7.795918,0.009128,1.077487
1139819,(alarm clock bakelike red ),"(jumbo bag apples, lunch box i love london, al...",0.096859,0.010471,0.010471,0.108108,10.324324,0.009457,1.109472
1139820,(childrens cutlery dolly girl ),"(jumbo bag apples, lunch box i love london, al...",0.073298,0.010471,0.010471,0.142857,13.642857,0.009704,1.154450


## Assess model
-------------



### Task

The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria and the desired test design. This task interferes with the subsequent evaluation phase. Whereas the data mining engineer judges the success of the application of modeling and discovery techniques more technically, he contacts business analysts and domain experts later in order to discuss the data mining results in the business context. Moreover, this task only considers models whereas the evaluation phase also takes into account all other results that were produced in the course of the project. The data mining engineer tries to rank the models. He assesses the models according to the evaluation criteria. As far as possible he also takes into account business objectives and business success criteria. In most data mining projects, the data mining engineer applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he also compares all results according to the evaluation criteria.

### Output

#### Model assessment

Summarize results of this task, list qualities of generated models (e.g.,in terms of accuracy) and rank their quality in relation to each other. 

#### Revised parameter settings

According to the model assessment, revise parameter settings and tune them for the next run in the Build Model task. Iterate model building and assessment until you strongly believe that you found the best model(s). Document all such revisions and assessments.

-----------------

### 1. Recommender system

As it was mentioned earlier, the testing of recommender system for a given data will not be trustworthy because of the lack of external validation. In the step 6 I'm planning to build an A/B testing infrastructure to assess the efficiency of proposed technique.