# DMC 2022
### Predicting user-based replenishment of a product based on historical orders and item features 

## 1. Task

The participating teams’ goal is to predict the user-based replenishment of a product based on
historical orders and item features. Individual items and user specific orders are given for the period
between 01.06.2020 and 31.01.2021. The prediction period is between 01.02.2021 and 28.02.2021,
which is exactly four weeks long.
For a predefined subset of user and product combinations, the participants shall predict if and when
a product will be purchased during the prediction period.
The prediction column in the “submission.csv” file must be filled accordingly.
* 0 - no replenishment during that period
* 1 - replenishment in the first week
* 2 - replenishment in the second week
* 3 - replenishment in the third week
* 4 - replenishment in the fourth week

## 2. Problem Definition

The problem we will be exploring is **multiclass classification**. Based on a number of different features we are trying to predict whether a product will be replenished by a certain customer in a specific week 1-4 or not at all 0.

## 3. Tools we are going to use

* [pandas](https://pandas.pydata.org/) for data analysis and data manipulation
* [Knime](https://www.knime.com/) for data analysis (outside of this notebook)
* [NumPy](https://numpy.org/) for numerical operations
* [Matplotlib](https://matplotlib.org/) for visualization
* [Scikit-Learn](https://scikit-learn.org/stable/) for machine learning modeling and evaluation
* [XGBoost](https://xgboost.readthedocs.io/en/stable/) for gradient boosting

## 4. Features

1. date
2. userID
3. itemID
4. order
5. brand
6. feature_1
7. feature_2
8. feature_3
9. feature_4
10. feature_5
11. categories
12. week

#### Not used
13. RCP
14. parent_category

## Imports and Functions

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import scipy as sc
import gc

import xgboost as xgb
from xgboost import XGBClassifier

from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.tree import DecisionTreeClassifier

def show_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

## Read data

In [2]:
#file1 = r'E:\OneDrive\Arbeit\Repos\DMC2022\Kevin\csv\06_complete_dataset_labeled_week0.csv'

file1 = r'E:\OneDrive\Arbeit\Repos\DMC2022\Kevin\csv\07_complete_dataset_labeled_noOnetimers_RCP.csv'
df_data = pd.read_csv(file1, sep='|', dtype={'userID':np.uint32,
                                            'date':str, 
                                            'itemID':np.uint32,
                                            'order':np.uint8,
                                            'brand':np.uint16,
                                            'feature_1':np.uint8,
                                            'feature_2':np.uint8,
                                            'feature_3':np.uint16,
                                            'feature_4':np.uint8,
                                            'feature_5':np.uint16,
                                            'week':np.uint8})
                     #chunksize=10000)

show_mem_usage(df_data)
df_data.head(10)

Memory usage of dataframe is 42.56 MB


Unnamed: 0,date,userID,itemID,order,brand,feature_1,feature_2,feature_3,feature_4,feature_5,categories,week,RCP,Mean(date&time diff)
0,2020-06-01,29737,5237,1,1201,10,0,53,3,87,"[327, 3129, 414, 4206]",1,0.26699,41.227027
1,2020-06-01,29737,11535,3,328,4,0,498,3,13,"[715, 3267]",1,0.158333,36.329897
2,2020-06-01,13081,16536,1,615,10,0,6,0,84,"[390, 2080, 536, 1708]",1,0.111111,50.378378
3,2020-06-01,19712,15299,2,1023,10,0,503,0,17,"[3672, 1091, 1085, 1578, 2325]",1,0.2625,45.744681
4,2020-06-01,19712,26623,3,38,10,0,528,0,132,"[2019, 1633, 482]",1,0.02439,27.0
5,2020-06-01,34083,18169,1,73,10,0,421,3,3,"[2116, 3224, 3156, 2690]",1,0.233503,44.677966
6,2020-06-01,23038,31567,1,408,4,0,334,0,44,"[1711, 2621, 2919]",1,0.188073,32.577465
7,2020-06-01,40277,30133,1,408,4,0,334,0,44,"[1711, 2621, 2919, 3924, 3915, 3914]",1,0.266026,39.968468
8,2020-06-01,29971,8793,1,990,4,0,474,0,65535,"[3867, 1998, 3025, 46, 3649, 3915, 3914]",1,0.142857,48.785714
9,2020-06-01,22109,8004,1,6,6,0,303,3,45,"[2977, 1772, 1118, 4025, 4026]",1,0.090032,43.764706


In [None]:
df_data.drop('lastPurchaseDate', axis=1, inplace=True)
df_data.drop('purchaseDates', axis=1, inplace=True)
df_data.head(10)

# Preprocessing

In [None]:
df_data = df_data.sort_values('date')
df_data.head(10)

### Multi-Hot-Encoding for categories

In contrast to One-Hot-Encoding where a column contains a single value which is converted to a one in the respective column, Multi-Hot-Encoding converts multiple entries in one cell to multiple ones in different columns. Therefore we first have to process the string in our category column, such that we can convert it into columns, without having duplicates.

#### Memory problem after Multi-Hot-Encoding
The problem we face when Multi-Hot-Encoding our categories is the following: After preprocessing and encoding we have 3.040.458.033 data points (904091 rows × 3363 columns). When trying to encode our categories with the str.get_dummies() method the size of the resulting dataframe is about ~30 GB depending on how many rows and features we are using. With a dataframe this big we run into memory problems when processing our data and building our model. 

#### Solution
There are are couple of different solutions to work around this problem. Normally we could work around memory limiations using batch processing or external memory. In case of the DMC dataset this is not optimal, since we need the whole customer history to make accurate predictions.

Since most of the colums we create from Multi-Hot-Encoding will be filled with zeros, we will be using a sparse matrix to significantly reduce the size of the resulting dataframe. The reduction we achieve with this approach results in dataframe size of 113 MB instead of ~30 GB.

In [None]:
# Convert strings to lists of integers in 'categories'
df_cat = df_data

df_cat["categories"] = df_cat["categories"].apply(lambda x: [int(i) for i in x[1:-1].split(',')])
df_cat["categories"]

In [None]:
# Multi-Hot-Encode columns with sparse output
c = df_cat["categories"]
mlb = MultiLabelBinarizer(sparse_output=False) # Set to True if output binary array is desired in CSR sparse format
df_multi_hot = pd.DataFrame(mlb.fit_transform(c), columns=mlb.classes_, index=None, dtype=np.int8)

show_mem_usage(df_multi_hot)

In [None]:
# Convert dataframe to sparse type
sparse_df_mh = df_multi_hot.astype(pd.SparseDtype("float64",0))
print(sparse_df_mh.info())
sparse_df_mh

In [None]:
del df_multi_hot
gc.collect()

In [None]:
%%time

# Combine df_data and sparse_df_mh
df_combined = df_cat.join(sparse_df_mh, how='inner')
show_mem_usage(df_combined)
df_combined.head()

In [None]:
# pop and append 'week' at end of dataframe
col = df_combined.pop("week")
df_combined.insert(len(df_combined.columns), col.name, col)
df_combined.head()

In [None]:
# Check if we have any missing values
df_combined[df_combined.isnull().any(axis=1)]

In [None]:
df_combined.drop('categories', axis=1, inplace=True)
show_mem_usage(df_combined)

# Model

### Splitting Training- / Testdata

In [None]:
df = df_combined.copy()
id(df), id(df_combined)

In [None]:
df.sort_values('date')

In [None]:
df.head()

In [None]:
# Get index of first occurance of january date for split
idx = df.date.searchsorted('2021-01-01', side='left') # list needs to be sorted already for searchsorted
idx

In [None]:
# check index
df['date'][idx], df['date'][idx - 1]

In [None]:
# drop date
df.drop('date', axis=1, inplace=True)

In [None]:
# Comma is being used to extract a specific column from a 2D array.
# X = data.iloc[:,:-1]
# X = all rows, all columns except the last one 

X = df.iloc[:,0:-1]
X

In [None]:
y = df.iloc[:,-1]
y

In [None]:
# Split training/test data
# train = jun-dec20 / test = jan21

X_train = X.iloc[:idx-1]
X_test = X.iloc[idx:]
y_train = y.iloc[:idx-1]
y_test = y.iloc[idx:]

In [None]:
#X_test.sample(frac=1)

In [None]:
show_mem_usage(X_train), show_mem_usage(X_test)
X_train

In [None]:
y_train

In [None]:
# Split training and test data
# parameter will preserve the proportion of target as in original dataset, in the train and test datasets as well.
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

#show_mem_usage(X_train), show_mem_usage(X_test)

# DecisionTreeClassifier

In [None]:
X_test

In [None]:
y_test

In [None]:
%%time

classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)

In [None]:
classifier.score(X_train,y_train), classifier.score(X_test,y_test)

In [None]:
y_train_pred = classifier.predict(X_train)
y_test_pred = classifier.predict(X_test)

dct_train = accuracy_score(y_train, y_train_pred)
dct_test = accuracy_score(y_test, y_test_pred)
print()
print(f'Decision Tree train/test accuracies: '
     f'{dct_train:.3f}/{dct_test:.3f}')

In [None]:
y_test_pred = list(y_test_pred)
y_test2 = list(y_test)

In [None]:
for i in range(len(y_test)):
    print(y_test2[i],y_prediction[i])

In [None]:
%%time

model1 = XGBClassifier()

gbm = model1.fit(X_train, y_train)

y_train_pred = gbm.predict(X_train)
y_test_pred = gbm.predict(X_test)

xgb_train = accuracy_score(y_train, y_train_pred)
xgb_test = accuracy_score(y_test, y_test_pred)
print()
print(f'XGboost train/test accuracies: '
     f'{xgb_train:.3f}/{xgb_test:.3f}')
