# DMC 2022
### Predicting user-based replenishment of a product based on historical orders and item features 

## 1. Task

The participating teams’ goal is to predict the user-based replenishment of a product based on
historical orders and item features. Individual items and user specific orders are given for the period
between 01.06.2020 and 31.01.2021. The prediction period is between 01.02.2021 and 28.02.2021,
which is exactly four weeks long.
For a predefined subset of user and product combinations, the participants shall predict if and when
a product will be purchased during the prediction period.
The prediction column in the “submission.csv” file must be filled accordingly.
* 0 - no replenishment during that period
* 1 - replenishment in the first week
* 2 - replenishment in the second week
* 3 - replenishment in the third week
* 4 - replenishment in the fourth week

## 2. Problem Definition

The problem we will be exploring is **multiclass classification**. Based on a number of different features we are trying to predict whether a product will be replenished by a certain customer in a specific week 1-4 or not at all 0.

## 3. Tools we are going to use

* [pandas](https://pandas.pydata.org/) for data analysis and data manipulation
* [Knime](https://www.knime.com/) for data analysis (outside of this notebook)
* [NumPy](https://numpy.org/) for numerical operations
* [Matplotlib](https://matplotlib.org/) for visualization
* [Scikit-Learn](https://scikit-learn.org/stable/) for machine learning modeling and evaluation
* [XGBoost](https://xgboost.readthedocs.io/en/stable/) for gradient boosting
* [Hyperopt](http://hyperopt.github.io/hyperopt/) for hyper-parameter optimization

## 4. Features

1. date
2. userID
3. itemID
4. order
5. brand
6. feature_1
7. feature_2
8. feature_3
9. feature_4
10. feature_5
11. categories
12. week

#### Not used
13. RCP
14. parent_category

## Imports and Functions

In [52]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import scipy as sc
import gc

import xgboost as xgb
from xgboost import XGBClassifier

from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.tree import DecisionTreeClassifier

import hyperopt
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

def show_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

## Read data

In [53]:
file1 = r'E:\OneDrive\Arbeit\Repos\DMC2022\Kevin\csv\08_datasets_monthly_split_w0_to_nxt_month_labeled\220613_07_dataset_w0-to-nxt-month_labeled_dec.csv'
file2 = r'E:\OneDrive\Arbeit\Repos\DMC2022\Kevin\csv\08_datasets_monthly_split_w0_to_nxt_month_labeled\220613_08_dataset_w0-to-nxt-month_labeled_jan.csv'
#file1 = r'E:\OneDrive\Arbeit\Repos\DMC2022\Kevin\csv\04_complete_dataset_labeled_wLastPurchaseDates_noOnetimers.csv'
df_data1 = pd.read_csv(file1, sep='|', dtype={'userID':np.uint32,
                                            'date':str, 
                                            'itemID':np.uint32,
                                            'order':np.uint8,
                                            'brand':np.int16,
                                            'feature_1':np.int8,
                                            'feature_2':np.uint8,
                                            'feature_3':np.int16,
                                            'feature_4':np.int8,
                                            'feature_5':np.int16,
                                            'week':np.uint8})
                     #chunksize=10000)
    
df_data2 = pd.read_csv(file2, sep='|', dtype={'userID':np.uint32,
                                            'date':str, 
                                            'itemID':np.uint32,
                                            'order':np.uint8,
                                            'brand':np.uint16,
                                            'feature_1':np.int8,
                                            'feature_2':np.uint8,
                                            'feature_3':np.int16,
                                            'feature_4':np.int8,
                                            'feature_5':np.int16,
                                            'week':np.uint8})

In [54]:
df_data1.drop('lastPurchaseDate', axis=1, inplace=True)
df_data1.drop('purchaseDates', axis=1, inplace=True)
df_data2.drop('lastPurchaseDate', axis=1, inplace=True)
df_data2.drop('purchaseDates', axis=1, inplace=True)
df_data1.head(10)

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,categories,week
0,2020-06-01,38769,3477,1,186,6,0,196,0,45,"[74, 4109, 3867, 803, 4053]",0
1,2020-06-01,42535,30474,1,193,10,3,229,3,132,"[3459, 3738, 679, 1628, 4072]",0
2,2020-06-01,42535,15833,1,1318,4,1,455,0,108,"[2973, 2907, 2749, 3357]",0
3,2020-06-01,42535,20131,1,347,4,0,291,3,44,"[30, 1515, 1760, 2932, 1287, 2615, 3727, 2450,...",0
4,2020-06-01,42535,4325,1,539,6,0,303,0,45,"[3104, 1772, 2029, 1274, 3915, 888, 1118, 3882...",0
5,2020-06-01,42535,12919,1,1338,10,0,26,0,39,"[813, 3949, 3961]",0
6,2020-06-01,29737,9139,1,703,10,0,413,3,3,"[626, 1995, 2896, 1605, 564, 3510, 1389, 2112,...",0
7,2020-06-01,29737,11535,3,328,4,0,498,3,13,"[715, 3267]",0
8,2020-06-01,43683,18733,1,1496,4,0,17,0,81,"[545, 1032, 3963]",0
9,2020-06-01,42535,15005,1,361,10,0,505,0,152,"[568, 1085, 2810, 2664, 3914, 3915]",0


In [4]:
df_data2.head(10)

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,categories,week
0,2020-06-01,38769,3477,1,186,6,0,196,0,45,"[74, 4109, 3867, 803, 4053]",0
1,2020-06-01,42535,30474,1,193,10,3,229,3,132,"[3459, 3738, 679, 1628, 4072]",0
2,2020-06-01,42535,15833,1,1318,4,1,455,0,108,"[2973, 2907, 2749, 3357]",0
3,2020-06-01,42535,20131,1,347,4,0,291,3,44,"[30, 1515, 1760, 2932, 1287, 2615, 3727, 2450,...",0
4,2020-06-01,42535,4325,1,539,6,0,303,0,45,"[3104, 1772, 2029, 1274, 3915, 888, 1118, 3882...",0
5,2020-06-01,42535,12919,1,1338,10,0,26,0,39,"[813, 3949, 3961]",0
6,2020-06-01,29737,9139,1,703,10,0,413,3,3,"[626, 1995, 2896, 1605, 564, 3510, 1389, 2112,...",0
7,2020-06-01,29737,11535,3,328,4,0,498,3,13,"[715, 3267]",0
8,2020-06-01,43683,18733,1,1496,4,0,17,0,81,"[545, 1032, 3963]",0
9,2020-06-01,42535,15005,1,361,10,0,505,0,152,"[568, 1085, 2810, 2664, 3914, 3915]",0


# Preprocessing

In [55]:
df_data1 = df_data1.sort_values('date')
df_data2 = df_data2.sort_values('date')

In [6]:
df_data1.tail()

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,categories,week
776709,2020-12-31,41002,13027,2,186,4,0,319,3,16,"[30, 1070, 1626, 377, 1060, 3268, 2104, 3915, ...",4
776708,2020-12-31,45327,16680,1,888,4,0,76,0,53,"[1158, 777, 855, 480, 2890, 1390, 3915, 3281, ...",4
776707,2020-12-31,45327,25993,1,186,10,0,27,3,38,"[545, 855, 813, 3444, 1763, 3924, 3915, 3912, ...",4
776721,2020-12-31,36113,23187,2,199,10,3,321,3,127,"[2389, 3485, 194, 2574, 358, 990, 1502, 3140, ...",4
778134,2020-12-31,27030,31073,1,1126,4,0,291,3,129,"[777, 30, 1763, 3727, 285, 3499, 3284, 3924]",4


### Multi-Hot-Encoding for categories

In contrast to One-Hot-Encoding where a column contains a single value which is converted to a one in the respective column, Multi-Hot-Encoding converts multiple entries in one cell to multiple ones in different columns. Therefore we first have to process the string in our category column, such that we can convert it into columns, without having duplicates.

#### Memory problem after Multi-Hot-Encoding
The problem we face when Multi-Hot-Encoding our categories is the following: After preprocessing and encoding we have 3.040.458.033 data points (904091 rows × 3363 columns). When trying to encode our categories with the str.get_dummies() method the size of the resulting dataframe is about ~30 GB depending on how many rows and features we are using. With a dataframe this big we run into memory problems when processing our data and building our model. 

#### Solution
There are are couple of different solutions to work around this problem. Normally we could work around memory limiations using batch processing or external memory. In case of the DMC dataset this is not optimal, since we need the whole customer history to make accurate predictions.

Since most of the colums we create from Multi-Hot-Encoding will be filled with zeros, we will be using a sparse matrix to significantly reduce the size of the resulting dataframe. The reduction we achieve with this approach results in dataframe size of 113 MB instead of ~30 GB.

In [57]:
# Insert dummy column to prevent feature size mismatch after Multi-Hot-Encoding
ls = [i for i in range(4300)]
df_tmp = pd.DataFrame({'date': ['1'],
                   'userID': [1],
                   'itemID': [1],
                   'order': [1],
                   'brand': [1],
                   'feature_1': [1],
                   'feature_2': [1],
                   'feature_3': [1],
                   'feature_4': [1],
                   'feature_5': [1],
                   'categories': str(ls),
                   'week': [-1]})
df_data1 = df_data1.append(df_tmp, ignore_index = True)
df_data2 = df_data2.append(df_tmp, ignore_index = True)

TypeError: DataFrame.append() got an unexpected keyword argument 'inplace'

In [8]:
# Convert strings to lists of integers in 'categories'
df_cat1 = df_data1
df_cat2 = df_data2

df_cat1["categories"] = df_cat1["categories"].apply(lambda x: [int(i) for i in x[1:-1].split(',')])
df_cat2["categories"] = df_cat2["categories"].apply(lambda x: [int(i) for i in x[1:-1].split(',')])

In [9]:
# Multi-Hot-Encode columns with sparse output
c = df_cat1["categories"]
mlb = MultiLabelBinarizer(sparse_output=False) # Set to True if output binary array is desired in CSR sparse format
df_multi_hot1 = pd.DataFrame(mlb.fit_transform(c), columns=mlb.classes_, index=None, dtype=np.int8).astype(pd.SparseDtype(np.uint8, 0))

c = df_cat2["categories"]
mlb = MultiLabelBinarizer(sparse_output=False) # Set to True if output binary array is desired in CSR sparse format
df_multi_hot2 = pd.DataFrame(mlb.fit_transform(c), columns=mlb.classes_, index=None, dtype=np.int8).astype(pd.SparseDtype(np.uint8), 0)

show_mem_usage(df_multi_hot1), show_mem_usage(df_multi_hot2)

Memory usage of dataframe is 3190.98 MB
Memory usage of dataframe is 3698.96 MB


(None, None)

In [11]:
del df_multi_hot1
del df_multi_hot2
gc.collect()

0

In [12]:
df_cat1.head()

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,categories,week
0,2020-06-01,38769,3477,1,186,6,0,196,0,45,"[74, 4109, 3867, 803, 4053]",0
1,2020-06-01,19039,26896,1,378,10,0,421,0,3,"[3224, 2580, 903, 2690]",0
2,2020-06-01,19039,7235,1,1508,4,1,458,0,65535,"[1114, 478]",0
3,2020-06-01,19039,29622,1,1276,4,0,27,3,66,"[813, 480, 1390, 3999]",0
4,2020-06-01,19039,21556,1,1201,4,0,29,0,176,"[813, 1680, 3915, 3999, 3949, 4039, 4069]",0


In [13]:
%%time

# Combine df_data and sparse_df_mh
df_combined1 = df_cat1.join(sparse_df_mh1, how='inner')
df_combined2 = df_cat2.join(sparse_df_mh2, how='inner')
show_mem_usage(df_combined1), show_mem_usage(df_combined2)
df_combined1.head()

Memory usage of dataframe is 127.36 MB
Memory usage of dataframe is 146.95 MB
CPU times: total: 281 ms
Wall time: 286 ms


Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,...,4290,4291,4292,4293,4294,4295,4296,4297,4298,4299
0,2020-06-01,38769,3477,1,186,6,0,196,0,45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2020-06-01,19039,26896,1,378,10,0,421,0,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2020-06-01,19039,7235,1,1508,4,1,458,0,65535,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2020-06-01,19039,29622,1,1276,4,0,27,3,66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2020-06-01,19039,21556,1,1201,4,0,29,0,176,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# pop and append 'week' at end of dataframe
col = df_combined1.pop("week")
df_combined1.insert(len(df_combined1.columns), col.name, col)

col = df_combined2.pop("week")
df_combined2.insert(len(df_combined2.columns), col.name, col)

df_combined1.head()

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,...,4291,4292,4293,4294,4295,4296,4297,4298,4299,week
0,2020-06-01,38769,3477,1,186,6,0,196,0,45,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,2020-06-01,19039,26896,1,378,10,0,421,0,3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,2020-06-01,19039,7235,1,1508,4,1,458,0,65535,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,2020-06-01,19039,29622,1,1276,4,0,27,3,66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
4,2020-06-01,19039,21556,1,1201,4,0,29,0,176,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [15]:
# Check if we have any missing values
df_combined1[df_combined1.isnull().any(axis=1)]

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,...,4291,4292,4293,4294,4295,4296,4297,4298,4299,week


In [16]:
# Check if we have any missing values
df_combined2[df_combined2.isnull().any(axis=1)]

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,...,4291,4292,4293,4294,4295,4296,4297,4298,4299,week


In [17]:
df_combined1.drop('categories', axis=1, inplace=True)
df_combined2.drop('categories', axis=1, inplace=True)
show_mem_usage(df_combined1), show_mem_usage(df_combined2)

Memory usage of dataframe is 121.42 MB
Memory usage of dataframe is 140.06 MB


(None, None)

In [18]:
df_combined1.tail()

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,...,4291,4292,4293,4294,4295,4296,4297,4298,4299,week
778131,2020-12-31,45327,16680,1,888,4,0,76,0,53,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
778132,2020-12-31,45327,25993,1,186,10,0,27,3,38,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
778133,2020-12-31,36113,23187,2,199,10,3,321,3,127,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
778134,2020-12-31,27030,31073,1,1126,4,0,291,3,129,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
778135,1,1,1,1,1,1,1,1,1,1,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,-1


In [19]:
# Drop 
df_combined1 = df_combined1.iloc[:-1 , :]
df_combined2 = df_combined2.iloc[:-1 , :]

In [20]:
df_combined2.tail()

Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,...,4291,4292,4293,4294,4295,4296,4297,4298,4299,week
902003,2021-01-31,38259,23411,1,1355,4,0,489,3,66,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
902004,2021-01-31,10236,6654,1,1496,10,0,359,0,97,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
902005,2021-01-31,21521,29277,1,127,10,0,519,0,8,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
902006,2021-01-31,43456,11639,1,926,6,0,497,0,13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
902007,2021-01-31,25974,17983,1,615,4,0,486,3,106,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4


# Model

### Splitting Training- / Testdata

In [21]:
df1 = df_combined1.copy()
df2 = df_combined2.copy()
id(df1), id(df_combined1), id(df2), id(df_combined2)

(2307846648592, 2309341828080, 2307846649168, 2309251570496)

In [22]:
#df1.sort_values('date')
#df2.sort_values('date')

In [23]:
df1.head(), df2.head()

(         date  userID  itemID  order  brand  feature_1  feature_2  feature_3  \
 0  2020-06-01   38769    3477      1    186          6          0        196   
 1  2020-06-01   19039   26896      1    378         10          0        421   
 2  2020-06-01   19039    7235      1   1508          4          1        458   
 3  2020-06-01   19039   29622      1   1276          4          0         27   
 4  2020-06-01   19039   21556      1   1201          4          0         29   
 
    feature_4  feature_5  ...  4291  4292  4293  4294  4295  4296  4297  4298  \
 0          0         45  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
 1          0          3  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
 2          0      65535  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
 3          3         66  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
 4          0        176  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
 
    4299  week  
 0   0.

In [24]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 778135 entries, 0 to 778134
Columns: 4311 entries, date to week
dtypes: Sparse[float64, 0](4300), int64(10), object(1)
memory usage: 121.4+ MB


In [25]:
# drop date
df1.drop('date', axis=1, inplace=True)
df2.drop('date', axis=1, inplace=True)

In [26]:
# Split training/test data
# train = jun-dec20 / test = jan21

X_train = df1.iloc[:, 0:-1]
X_test = df2.iloc[:, 0:-1]
y_train = df1.iloc[:,-1]
y_test = df2.iloc[:,-1]

In [27]:
X_train

Unnamed: 0,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,0,...,4290,4291,4292,4293,4294,4295,4296,4297,4298,4299
0,38769,3477,1,186,6,0,196,0,45,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,19039,26896,1,378,10,0,421,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,19039,7235,1,1508,4,1,458,0,65535,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,19039,29622,1,1276,4,0,27,3,66,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19039,21556,1,1201,4,0,29,0,176,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778130,41002,13027,2,186,4,0,319,3,16,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778131,45327,16680,1,888,4,0,76,0,53,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778132,45327,25993,1,186,10,0,27,3,38,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778133,36113,23187,2,199,10,3,321,3,127,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
type(X_train)

pandas.core.frame.DataFrame

In [29]:
show_mem_usage(X_train), show_mem_usage(X_test)

Memory usage of dataframe is 109.50 MB
Memory usage of dataframe is 126.25 MB


(None, None)

In [30]:
X_train

Unnamed: 0,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,0,...,4290,4291,4292,4293,4294,4295,4296,4297,4298,4299
0,38769,3477,1,186,6,0,196,0,45,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,19039,26896,1,378,10,0,421,0,3,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,19039,7235,1,1508,4,1,458,0,65535,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,19039,29622,1,1276,4,0,27,3,66,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,19039,21556,1,1201,4,0,29,0,176,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
778130,41002,13027,2,186,4,0,319,3,16,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778131,45327,16680,1,888,4,0,76,0,53,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778132,45327,25993,1,186,10,0,27,3,38,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
778133,36113,23187,2,199,10,3,321,3,127,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
X_test

Unnamed: 0,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,0,...,4290,4291,4292,4293,4294,4295,4296,4297,4298,4299
0,38769,3477,1,186,6,0,196,0,45,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,19039,29622,1,1276,4,0,27,3,66,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,19039,21556,1,1201,4,0,29,0,176,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,17085,14684,1,1194,10,1,503,0,17,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17085,19969,3,1194,10,0,503,0,85,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902003,38259,23411,1,1355,4,0,489,3,66,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
902004,10236,6654,1,1496,10,0,359,0,97,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
902005,21521,29277,1,127,10,0,519,0,8,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
902006,43456,11639,1,926,6,0,497,0,13,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
y_train

0         0
1         0
2         0
3         0
4         0
         ..
778130    4
778131    4
778132    4
778133    4
778134    4
Name: week, Length: 778135, dtype: int64

In [33]:
y_test

0         0
1         0
2         0
3         0
4         0
         ..
902003    4
902004    4
902005    4
902006    4
902007    4
Name: week, Length: 902008, dtype: int64

In [34]:
# Split training and test data
# parameter will preserve the proportion of target as in original dataset, in the train and test datasets as well.
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

#show_mem_usage(X_train), show_mem_usage(X_test)

# DecisionTreeClassifier

In [35]:
"""
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
"""

'\nfrom sklearn.preprocessing import LabelEncoder\nle = LabelEncoder()\ny_train = le.fit_transform(y_train)\n'

In [36]:
"""%%time

classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
"""

'%%time\n\nclassifier = DecisionTreeClassifier()\nclassifier = classifier.fit(X_train,y_train)\n'

In [37]:
"""y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

dct_train = accuracy_score(y_train, y_train_pred)
dct_test = accuracy_score(y_test, y_test_pred)
print()
print(f'Decision Tree train/test accuracies: '
     f'{dct_train:.3f}/{dct_test:.3f}')
"""

"y_train_pred = classifier.predict(X_train)\ny_test_pred = classifier.predict(X_test)\n\ndct_train = accuracy_score(y_train, y_train_pred)\ndct_test = accuracy_score(y_test, y_test_pred)\nprint()\nprint(f'Decision Tree train/test accuracies: '\n     f'{dct_train:.3f}/{dct_test:.3f}')\n"

In [38]:
%%time

# Model with standard settings
model1 = XGBClassifier(tree_method='gpu_hist', gpu_id=0,
                    n_estimators = 15, max_depth = 3, gamma = 2,
                    reg_alpha = 50, reg_lambda = 0.5, min_child_weight=5,
                    colsample_bytree=0.5)
gbm = model1.fit(X_train, y_train)

y_train_pred = gbm.predict(X_train)
y_test_pred = gbm.predict(X_test)

xgb_train = accuracy_score(y_train, y_train_pred)
xgb_test = accuracy_score(y_test, y_test_pred)
print()
print(f'XGboost train/test accuracies: '
     f'{xgb_train:.3f}/{xgb_test:.3f}')


XGboost train/test accuracies: 0.822/0.817
CPU times: total: 14min 32s
Wall time: 1min 54s


In [39]:
# create dataframe from test-prediction with index from X_test
df_y_test_pred = pd.DataFrame(y_test_pred, columns=['week_pred'], index=X_test.index, dtype=np.int8)

# concatenate X_test, y_test, y_pred (put columns next to each other)
df_eval_test = pd.concat([X_test, y_test, df_y_test_pred], axis=1)

In [40]:
df_y_test_pred

Unnamed: 0,week_pred
0,0
1,0
2,0
3,0
4,0
...,...
902003,0
902004,0
902005,0
902006,0


In [41]:
df_eval_test

Unnamed: 0,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,0,...,4292,4293,4294,4295,4296,4297,4298,4299,week,week_pred
0,38769,3477,1,186,6,0,196,0,45,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
1,19039,29622,1,1276,4,0,27,3,66,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
2,19039,21556,1,1201,4,0,29,0,176,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
3,17085,14684,1,1194,10,1,503,0,17,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
4,17085,19969,3,1194,10,0,503,0,85,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902003,38259,23411,1,1355,4,0,489,3,66,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
902004,10236,6654,1,1496,10,0,359,0,97,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
902005,21521,29277,1,127,10,0,519,0,8,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0
902006,43456,11639,1,926,6,0,497,0,13,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0


In [42]:
pd.set_option('display.max_rows', 1000)

In [43]:
# show 150 rows of predicted weeks where prediction was wrong & week != 0
df_eval_test.loc[(df_eval_test['week'] != df_eval_test['week_pred']) & (df_eval_test['week'] != 0)].head(250)

Unnamed: 0,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,0,...,4292,4293,4294,4295,4296,4297,4298,4299,week,week_pred
736684,24617,24175,1,347,4,0,22,3,151,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736685,24617,8,1,1048,4,0,28,0,175,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736686,12590,20146,1,1338,4,0,107,0,44,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736687,24617,6029,1,1445,3,0,65535,255,65535,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736688,4409,19133,3,827,6,0,491,3,48,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736689,24617,11958,1,1201,4,0,291,3,44,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736690,24617,22351,1,408,6,0,166,0,122,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736691,18599,22553,1,194,10,0,503,0,122,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736692,24816,24483,1,194,10,0,503,0,17,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0
736693,24816,6065,1,615,4,0,122,3,16,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0


In [44]:
# Write to csv
df_eval_export1 = df_eval_test.loc[:,8]
df_eval_export2 = df_eval_test.iloc[:,-2]
type(df_eval_export1)
#df_eval_export = df_eval_export1.join(df_eval_export2, how='outer')
#df_eval_test.to_csv('E:\OneDrive\Arbeit\Repos\DMC2022\Kevin\csv\eval.csv', sep='|', encoding='utf-8', index=False)

pandas.core.series.Series

In [45]:
y_test_pred = list(y_test_pred)
y_test2 = list(y_test)

In [46]:
for i in range(len(y_test)):
    print(y_test2[i],y_test_pred[i])

0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0
4 0

In [47]:
type(y_test_pred)

list

### Define domain space for range of values 

In [48]:
space = {'max_depth': hp.quniform("max_depth", 3, 5, 1),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': 180,
        'seed': 0
    }

### Define objective function

In [49]:
def objective(space):
    clf=xgb.XGBClassifier(tree_method='gpu_hist', gpu_id=0,
                    n_estimators =space['n_estimators'], max_depth = int(space['max_depth']), gamma = space['gamma'],
                    reg_alpha = int(space['reg_alpha']),min_child_weight=int(space['min_child_weight']),
                    colsample_bytree=int(space['colsample_bytree']))
    
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="auc",
            early_stopping_rounds=10,verbose=False)
    

    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred>0.5)
    print ("SCORE:", accuracy)
    return {'loss': -accuracy, 'status': STATUS_OK }

### Minimize the objective over the space

In [50]:
dir()

['DecisionTreeClassifier',
 'In',
 'MultiLabelBinarizer',
 'Out',
 'STATUS_OK',
 'Trials',
 'XGBClassifier',
 'X_test',
 'X_train',
 '_',
 '_10',
 '_11',
 '_12',
 '_13',
 '_14',
 '_15',
 '_16',
 '_17',
 '_18',
 '_20',
 '_21',
 '_23',
 '_27',
 '_28',
 '_29',
 '_3',
 '_30',
 '_31',
 '_32',
 '_33',
 '_35',
 '_36',
 '_37',
 '_4',
 '_40',
 '_41',
 '_43',
 '_44',
 '_47',
 '_6',
 '_9',
 '__',
 '___',
 '__builtin__',
 '__builtins__',
 '__doc__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_dh',
 '_i',
 '_i1',
 '_i10',
 '_i11',
 '_i12',
 '_i13',
 '_i14',
 '_i15',
 '_i16',
 '_i17',
 '_i18',
 '_i19',
 '_i2',
 '_i20',
 '_i21',
 '_i22',
 '_i23',
 '_i24',
 '_i25',
 '_i26',
 '_i27',
 '_i28',
 '_i29',
 '_i3',
 '_i30',
 '_i31',
 '_i32',
 '_i33',
 '_i34',
 '_i35',
 '_i36',
 '_i37',
 '_i38',
 '_i39',
 '_i4',
 '_i40',
 '_i41',
 '_i42',
 '_i43',
 '_i44',
 '_i45',
 '_i46',
 '_i47',
 '_i48',
 '_i49',
 '_i5',
 '_i50',
 '_i6',
 '_i7',
 '_i8',
 '_i9',
 '_ih',
 '_ii',
 '_iii',
 '_oh',
 'accuracy_s

In [51]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  



NameError: name 'best_hyperparams' is not defined

In [None]:
"""space = { 'eta': hp.quniform('eta', 0.025, 0.5, 0.05),
        'max_depth': hp.quniform("max_depth", 1, 18, 1),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': 180,
        'seed': 0
    }
"""

In [None]:
"""
def objective(space):
    clf=xgb.XGBClassifier(
                    n_estimators = space['n_estimators'], max_depth = int(space['max_depth']), gamma = space['gamma'],
                    reg_alpha = int(space['reg_alpha']),min_child_weight=int(space['min_child_weight']),
                    colsample_bytree=int(space['colsample_bytree']))
    
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="auc",
            early_stopping_rounds=10,verbose=False)
    

    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred>0.5)
    print ("SCORE:", accuracy)
    return {'loss': -accuracy, 'status': STATUS_OK }
"""

In [None]:
"""
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 100,
                        trials = trials)
"""

In [None]:
!conda list

In [None]:
!pip list