In [1]:
# # mount drive
# from google.colab import drive
# drive.mount('/content/drive')

In [2]:
# # move directory
# import os
# colab_dir = "./drive/MyDrive/"
# os.chdir(colab_dir)

In [2]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib_inline
%matplotlib inline

In [3]:
# # import python module
# from python.xxxx import XXXX

In [4]:
# set random seed
import random
random.seed(335)

In [5]:
# magic word
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

In [6]:
# for better viz
import pprint
import warnings
warnings.filterwarnings('ignore')

### reference
-------------------

- [pandas cheat sheet](https://github.com/pandas-dev/pandas/tree/master/doc/cheatsheet)
- [numpy cheat sheet(data camp)](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet)
- [scikit-learn cheat sheet(data camp)](datacamp.com/community/blog/scikit-learn-cheat-sheet)

# modeling
---------------------
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

## select modeling techuique
----------

### task

As the first step in modeling, select the actual modeling technique that is to be used. Whereas you possibly already selected a tool in business understanding, this task refers to the specific modeling technique, e.g.,decision tree building with C4.5 or neural network generation with back propagation. If multiple techniques are applied, perform this task for each technique separately.

### output

#### modeling technique

Document the actual modeling technique that is to be used.

#### modeling assumptions

Many modeling techniques make specific assumptions on the data, e.g.,all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.

## generate test design
------------------

### task

Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.

### output

Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation datasets.


## build model
----------

### task

Run the modeling tool on the prepared dataset to create one or more models.

### output

#### parameter settings 

With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen value, along with the rationale for the choice of parameter settings. 

#### models 

These are the actual models produced by the modeling tool, not a report.

#### model description

describe the resultant model. Report on the interpretation of the models and document any difficulties encountered with their meanings.


In [96]:
df = pd.read_csv("data/input.xlsx")
df

Unnamed: 0,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,sum,time,Australia,Austria,Bahrain,Belgium,Brazil,Canada,Channel Islands,Cyprus,Czech Republic,Denmark,EIRE,European Community,Finland,France,Germany,Greece,Hong Kong,Iceland,Israel,Italy,Japan,Lebanon,Lithuania,Malta,Netherlands,Norway,Poland,Portugal,RSA,Saudi Arabia,Singapore,Spain,Sweden,Switzerland,USA,United Arab Emirates,United Kingdom
0,WHITE HANGING HEART T-LIGHT HOLDER,6,734107,2.55,17850.0,15.30,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,WHITE METAL LANTERN,6,734107,3.39,17850.0,20.34,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,CREAM CUPID HEARTS COAT HANGER,8,734107,2.75,17850.0,22.00,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,KNITTED UNION FLAG HOT WATER BOTTLE,6,734107,3.39,17850.0,20.34,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,RED WOOLLY HOTTIE WHITE HEART.,6,734107,3.39,17850.0,20.34,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
343513,PACK OF 20 SPACEBOY NAPKINS,12,734480,0.85,12680.0,10.20,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
343514,CHILDREN'S APRON DOLLY GIRL,6,734480,2.10,12680.0,12.60,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
343515,CHILDRENS CUTLERY DOLLY GIRL,4,734480,4.15,12680.0,16.60,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
343516,CHILDRENS CUTLERY CIRCUS PARADE,4,734480,4.15,12680.0,16.60,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [97]:
X = df.drop(["sum", "Description", 'InvoiceDate', 'CustomerID'], axis=1)
y = df["sum"]
X

Unnamed: 0,Quantity,UnitPrice,time,Australia,Austria,Bahrain,Belgium,Brazil,Canada,Channel Islands,Cyprus,Czech Republic,Denmark,EIRE,European Community,Finland,France,Germany,Greece,Hong Kong,Iceland,Israel,Italy,Japan,Lebanon,Lithuania,Malta,Netherlands,Norway,Poland,Portugal,RSA,Saudi Arabia,Singapore,Spain,Sweden,Switzerland,USA,United Arab Emirates,United Kingdom
0,6,2.55,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,6,3.39,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,8,2.75,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,6,3.39,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,6,3.39,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
343513,12,0.85,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
343514,6,2.10,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
343515,4,4.15,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
343516,4,4.15,12,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [98]:
# split data
## random split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.3)

### regression
-------------

In [99]:
from sklearn.linear_model import LinearRegression
regression = LinearRegression().fit(X_train, y_train)

In [100]:
for i in range(len(X_train.columns)):
    print(f"type {X_train.columns[i]} has coef {regression.coef_[i]}")
print("\nThese countries has negative coefficients:")
for i in range(len(X_train.columns[4:])):
    coef = regression.coef_[4:][i]
    if coef < 0:
        print(X_train.columns[4:][i])
#These countries are profitable, because negative coefficient means other coefficients matter more(quantity, price, time)
print("\nThese countries has large coefficients more than 5:")
for i in range(len(X_train.columns[4:])):
    coef = regression.coef_[4:][i]
    if coef > 5:
        print(X_train.columns[4:][i])
#it isn't profitable to sell there, it means an organization should demolish more stores there, also logistics for these countries are costly.

type Quantity has coef 1.6074041248548068
type UnitPrice has coef 3.9920328359728723
type time has coef -0.1716432177292786
type Australia has coef 21.866474828325025
type Austria has coef -0.8563858887515415
type Bahrain has coef -10.717842428475937
type Belgium has coef -4.471186335687551
type Brazil has coef 5.612729517260292
type Canada has coef -4.627432603372659
type Channel Islands has coef -3.7400019363327455
type Cyprus has coef -4.423242227195643
type Czech Republic has coef -7.581043681361836
type Denmark has coef 14.370350932822518
type EIRE has coef 1.89330274033415
type European Community has coef -6.249520993221812
type Finland has coef 1.8161069704187431
type France has coef -2.7470939729040444
type Germany has coef -1.3605090718377029
type Greece has coef 6.143403794674936
type Hong Kong has coef 0.0
type Iceland has coef -1.3566439921861386
type Israel has coef -1.458085229392896
type Italy has coef -4.805514556214717
type Japan has coef 23.061715898151185
type Lebano

In [101]:
df.groupby('time').sum()["sum"]
#best possible time to sell is 12:00

time
6           4.25
7       29480.30
8      261659.64
9      781000.94
10    1173879.27
11    1016041.49
12    1265369.19
13    1060944.28
14     911553.22
15     847506.63
16     430621.88
17     212435.00
18      99479.11
19      43851.36
20      17994.11
Name: sum, dtype: float64

In [25]:
df_neg = pd.read_csv("data/negative.xlsx")
df_neg = df_neg.drop(["Description", "InvoiceDate", "CustomerID"], axis=1)
df_neg

Unnamed: 0,Quantity,UnitPrice,sum,time,Australia,Austria,Bahrain,Belgium,Channel Islands,Cyprus,Czech Republic,Denmark,EIRE,European Community,Finland,France,Germany,Greece,Hong Kong,Israel,Italy,Japan,Malta,Netherlands,Norway,Poland,Portugal,Saudi Arabia,Singapore,Spain,Sweden,Switzerland,USA,United Kingdom
0,-1,27.50,-27.50,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,-1,4.65,-4.65,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,-12,1.65,-19.80,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,-24,0.29,-6.96,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,-24,0.29,-6.96,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10619,-11,0.83,-9.13,9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10620,-1,224.69,-224.69,10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10621,-5,10.95,-54.75,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
10622,-1,1.25,-1.25,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [26]:
from sklearn.linear_model import LinearRegression
y = df_neg["sum"]
x = df_neg.drop(["sum"], axis=1)
regression_neg = LinearRegression().fit(x,y)

In [27]:
for i in range(len(x.columns)):
    coef = regression_neg.coef_[i]
    print(f"{x.columns[i]}: {coef}")
    #more coef more likely customer will give back product

Quantity: 1.5333803979997265
UnitPrice: -1.0073450555839025
time: 1.1485974618550663
Australia: 6.789503317935911
Austria: 23.62803346536054
Bahrain: -109.31281403697189
Belgium: 7.783381196253154
Channel Islands: 7.486491621607564
Cyprus: -11.466914514312846
Czech Republic: 10.362827988584293
Denmark: 3.1992099436240578
EIRE: -2.0746435430901573
European Community: 10.931832014795502
Finland: 8.947699870114015
France: -16.599614989466804
Germany: 7.58340808665523
Greece: 8.241500600640299
Hong Kong: 19.54672516581085
Israel: -60.38488085824661
Italy: 4.929470695924252
Japan: -10.875976326310175
Malta: 5.80710797610829
Netherlands: 98.72562708240777
Norway: 4.712541819661059
Poland: 4.15997200466928
Portugal: 9.646814806299949
Saudi Arabia: 5.675229714129914
Singapore: 22.437443310267106
Spain: -33.13310060731816
Sweden: -83.46143666550923
Switzerland: 6.14566459711596
USA: 9.78700126533073
United Kingdom: 40.78189499772973


### classification
------------

In [102]:
# from sklearn.cluster import KMeans
# from sklearn.metrics import silhouette_score
# kmeans = KMeans().fit(X_train)
# kmeans.predict(X_test)
# centers = kmeans.cluster_centers_
# centers

#### imbalanced data

In [103]:
# !pip install -U scikit-learn
# !pip install -U imbalanced-learn
#if you haven't installed. Reboot jupyter after installing packages, if you have errors with imports.

In [104]:
# imbalanced data:
#https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/
## random under sampler
from imblearn.under_sampling import RandomUnderSampler
sampler = RandomUnderSampler()

## random over sampler
# from imblearn.over_sampling import RandomOverSampler
# sampler = RandomOverSampler()

## SMOTE
# from imblearn.over_sampling import SMOTE
# sampler = SMOTE()

## TomekLinks
# from imblearn.under_sampling import TomekLinks
# sampler = TomekLinks()

## NearMiss
# from imblearn.under_sampling import NearMiss
# sampler = NearMiss()

## sampling
# X_train, y_train = sampler.fit_resample(X_train, y_train)

#Don't needed because k means is unsupervised learning, regression doesn't predict categorical data.

#### modeling

In [105]:
# # SVM
# from sklearn.svm import SVC
# model = SVC(kernel='linear', random_state=None)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)

In [106]:
# # LightGBM
# !pip install optuna
# import optuna.integration.lightgbm as lgb
# lgb_train = lgb.Dataset(X_train, y_train)
# lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# params = {
#     "objective" : "multiclass",
#     "metric" : "multi_logloss",
#     "num_class" : len(y.unique())
# }
# model = lgb.train(params, lgb_train, valid_sets=lgb_eval)
# y_prob = model.predict(X_test, num_iteration=model.best_iteration)
# y_pred = np.argmax(y_prob, axis=1)

## assess model
-------------

### task

The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria and the desired test design. This task interferes with the subsequent evaluation phase. Whereas the data mining engineer judges the success of the application of modeling and discovery techniques more technically, he contacts business analysts and domain experts later in order to discuss the data mining results in the business context. Moreover, this task only considers models whereas the evaluation phase also takes into account all other results that were produced in the course of the project. The data mining engineer tries to rank the models. He assesses the models according to the evaluation criteria. As far as possible he also takes into account business objectives and business success criteria. In most data mining projects, the data mining engineer applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he also compares all results according to the evaluation criteria.

### output

#### model assessment

Summarize results of this task, list qualities of generated models (e.g.,in terms of accuracy) and rank their quality in relation to each other. 

#### revised parameter settings

According to the model assessment, revise parameter settings and tune them for the next run in the Build Model task. Iterate model building and assessment until you strongly believe that you found the best model(s). Document all such revisions and assessments.


### regression
-----------------

In [107]:
from sklearn.metrics import mean_squared_error
y_pred = regression.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
count = X_test.shape[0]
print(f'for {count} rows mse is {mse}')

for 103056 rows mse is 1769.347151203462


### classification
----------------


In [94]:
# # accuracy
# from sklearn.metrics import accuracy_score
# accuracy = accuracy_score(y_test, y_pred)
# print(str('{:.1g}'.format(accuracy * 100)) + '%')

In [95]:
# # confusion matrix
# from sklearn.metrics import confusion_matrix
# cm = confusion_matrix(y_test, y_pred)
# print(cm)

#### binary classification

#### multi-class classification

## note/questions
-------------

#### select modeling technique
modeling technique - linear regression. \
Modeling assumptions - no missing values allowed, all attributes have uniform distributions
#### generate test design
Dataset is divided to 30% test set, 70% - train set.
#### build model
Regression: Model is build to determine profit based of in which country customer is located, price of product, date of purchase. So that we can identify which products(from which countries and their price and quantity) are best suited for profit and in which countries we should sell products and which time is best for selling and should we increase quantity of the same products.
Another model is about chances to give product back, based on characteristics such as countries, time, quantity and price.
#### assess model
For test set(103056 rows) mse was around 1770. \
In average model has 0.017 error. \
Model doesn't have any hyperparameters, so no tuning needed.
We won't need to assess another model, because we just determining chances to give product back based on characteristics, we aren't predicting anything, it is used for this data.