### Data Mining course
**Student:**    Danis Alukaev <br>
**Email:**      d.alukaev@innopolis.university <br>
**Group:**      B19-DS-01

## Table of contents

- Prerequisites
  - Select modeling technique
  - Generate test design
  - Build model
  - Assess model
- Modeling

## Prerequisites

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib_inline
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
import random
import pprint
import warnings
import os
import nltk

%matplotlib inline
random.seed(42)

In [2]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

warnings.filterwarnings('ignore')

In [3]:
ds_dir = './data/'
ds = 'completed_orders.csv'
path = os.path.join(ds_dir, ds)

data = pd.read_csv(path)

In [4]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Total,Year,Month,Day,Weekday,Hour,Segment,Population,Wage,Holiday
0,536365,8512,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
1,536365,7105,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
2,536365,8440,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
3,536365,8402,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False
4,536365,8402,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,2010,12,1,Wednesday,8,Can't lose them,62.026,44521,False


# Modeling
---------------------
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

## Select modeling techuique
----------


### Task

As the first step in modeling, select the actual modeling technique that is to be used. Whereas you possibly already selected a tool in business understanding, this task refers to the specific modeling technique, e.g.,decision tree building with C4.5 or neural network generation with back propagation. If multiple techniques are applied, perform this task for each technique separately.

### Output

#### Modeling technique

Document the actual modeling technique that is to be used.

#### Modeling assumptions

Many modeling techniques make specific assumptions on the data, e.g.,all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.

----------------

### 1. Recommender system
Baseline: Association Rules learning. Determines the items that are likely to be bought together. 

Option 1: Neural Network to predict the content of cart based on features.

Option 2: External Recommender system based on clusterization. The customer is associated with some cluster that shares some common set of items. The most frequently bought items will be suggested to the user.

### 2. Logistic optimization

Field: regression.

Determines the expected number of sold by the end of month items.  


Baseline: Random Forest Regressor. 

### 3. Predicting cancellations

Field: Classification.

### 4. Predicting the total revenue

Field: sequence models.

## Generate test design
------------------




### Task

Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.

### Output

Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation datasets.

-----------------------

## Build model
----------



### Task

Run the modeling tool on the prepared dataset to create one or more models.

### Output

#### Parameter settings 

With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen value, along with the rationale for the choice of parameter settings. 

#### Models 

These are the actual models produced by the modeling tool, not a report.

#### Model description

Describe the resultant model. Report on the interpretation of the models and document any difficulties encountered with their meanings.
--------------

In [None]:
# split data
## closed test
X_train, X_test, y_train, y_test = X, X, y, y

## random split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Regression
-------------

### Classification
------------

#### Imbalanced data

In [None]:
# imbalanced data: https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/
## random under sampler
from imblearn.under_sampling import RandomUnderSampler
sampler = RandomUnderSampler()

## random over sampler
from imblearn.over_sampling import RandomOverSampler
sampler = RandomOverSampler()

## SMOTE
from imblearn.over_sampling import SMOTE
sampler = SMOTE()

## TomekLinks
from imblearn.under_sampling import TomekLinks
sampler = TomekLinks()

## NearMiss
from imblearn.under_sampling import NearMiss
sampler = NearMiss()

## sampling
X_train, y_train = sampler.fit_resample(X_train, y_train)

#### Modeling

In [None]:
# SVM
from sklearn.svm import SVC
model = SVC(kernel='linear', random_state=None)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
# LightGBM
!pip install optuna
import optuna.integration.lightgbm as lgb
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
    "objective" : "multiclass",
    "metric" : "multi_logloss",
    "num_class" : len(y.unique())
}
model = lgb.train(params, lgb_train, valid_sets=lgb_eval)
y_prob = model.predict(X_test, num_iteration=model.best_iteration)
y_pred = np.argmax(y_prob, axis=1)

## Assess model
-------------



### Task

The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria and the desired test design. This task interferes with the subsequent evaluation phase. Whereas the data mining engineer judges the success of the application of modeling and discovery techniques more technically, he contacts business analysts and domain experts later in order to discuss the data mining results in the business context. Moreover, this task only considers models whereas the evaluation phase also takes into account all other results that were produced in the course of the project. The data mining engineer tries to rank the models. He assesses the models according to the evaluation criteria. As far as possible he also takes into account business objectives and business success criteria. In most data mining projects, the data mining engineer applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he also compares all results according to the evaluation criteria.

### Output

#### Model assessment

Summarize results of this task, list qualities of generated models (e.g.,in terms of accuracy) and rank their quality in relation to each other. 

#### Revised parameter settings

According to the model assessment, revise parameter settings and tune them for the next run in the Build Model task. Iterate model building and assessment until you strongly believe that you found the best model(s). Document all such revisions and assessments.
-----------------

### Regression
-----------------

### Classification
----------------


In [None]:
# accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(str('{:.1g}'.format(accuracy * 100)) + '%')

In [None]:
# confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

#### Binary classification

#### Multi-class classification