# Next Purchase Date with ML - Top Items

In this use case, we want to know the model's performance if trained with some items with the highest number of transactions. Hopefully, we can achieve the best score because using the largest number of transactions compared to the others will make the model learn and perform better.

## Load and Preprocess Data

We use more than 500k transaction data between users and items from the EPM database. The raw data still has some returning transactions with a negative amount, but we are only looking for buying transactions. Each transaction has a timestamp record daily. Because a user can buy the same item multiple times on the same day, we consider it a single data aggregating the sales quantity column.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
from xgboost import XGBRegressor

In [None]:
# Load Dataset
df = pd.read_csv('/kaggle/input/epm-prep/EPM.csv')
df = df.drop(['Unnamed: 0', 'principal_code'], axis=1)
df.head()

In [None]:
# Filter negative transactions
df = df[(df['sales_qty'] > 0) & (df['gross_sales_amount'] > 0)]
df['trx_date'] = pd.to_datetime(df['trx_date'])
df.head()

In [None]:
# Drop duplicate transactions
temp = df[['ship_to_id', 'item_code', 'trx_date', 'sales_qty']].groupby(['ship_to_id', 'item_code', 'trx_date']).sum().reset_index(drop=True)
df = df.drop_duplicates(['ship_to_id', 'item_code', 'trx_date']).reset_index(drop=True)
df['sales_qty'] = temp
df.head()

## Preproecssing for Usecase

We don't use all the features but only focus on the sales quantity and timestamp column.

In the process of searching the selected data, we do these steps:
1. Sort the unique items from the data based on the number of transactions.
2. Choose the top x items with the highest number of transactions (in this case, x=1).
3. From the selected item, sort the unique users from the data related to the chosen item only.
4. Choose the top y users with the highest number of transactions (in this case, y=10).

In [None]:
# Create timestamp column
df['timestamp'] = df['trx_date'].apply(lambda x: x.timestamp())
df['trx_date'] = pd.to_datetime(df['trx_date'])
df = df[['ship_to_id', 'item_code', 'trx_date', 'sales_qty', 'timestamp']]
df.head()

In [None]:
# Look for top items with the most transactions 
dfSortItem = df.groupby(['item_code']).count().sort_values('trx_date', ascending=False).reset_index()
dfSortItem[['item_code', 'ship_to_id']]

In [None]:
# Show data with the first top item
itemMax = dfSortItem.loc[0, 'item_code']
dfItemMax = df[df['item_code'] == itemMax].reset_index(drop=True)
dfItemMax

In [None]:
# Look for top users with most transactions with the related item
dfItemMaxSortUser = dfItemMax.groupby(['ship_to_id']).count().sort_values('trx_date', ascending=False).reset_index()
dfItemMaxSortUser.head(10)

In [None]:
top = [i for i in range(10)]
usersMax = dfItemMaxSortUser.loc[top, 'ship_to_id']
list(usersMax)

## Training Model

After choosing a specific user and item, we put those data into the model. We try three different models and compare the performance. The models are trained with 80% of the data and evaluated with the rest by the Mean Average Error and Mean Squared Error metrics.

In [None]:
dfTrain = pd.DataFrame()
dfTest = pd.DataFrame()

for i in usersMax:

    dfUser = df[(df['item_code'] == itemMax) & (df['ship_to_id'] == i)].sort_values('trx_date').reset_index(drop=True)
    dfUser['period'] = dfUser['trx_date'].diff().apply(lambda x: x.days)[1:].reset_index(drop=True)
    dfTrain, dfTest = dfUser[:-1], dfUser[-1:] # The last row doesnt have interval value for next purchase

    dfTest = dfTest.drop(['period', 'ship_to_id', 'item_code', 'trx_date'], axis=1)
    X = dfTrain.drop(['period', 'ship_to_id', 'item_code', 'trx_date'], axis=1)
    y = dfTrain['period']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Linear Regression
    modelLR = LinearRegression()
    modelLR.fit(X_train, y_train)
    y_pred = modelLR.predict(X_test)
    maeLR = mean_absolute_error(y_test, y_pred)
    mseLR = mean_squared_error(y_test, y_pred)

    # Random Forest
    modelRF = RandomForestRegressor(random_state=42)
    modelRF.fit(X_train, y_train)
    y_pred = modelRF.predict(X_test)
    maeRF = mean_absolute_error(y_test, y_pred)
    mseRF = mean_squared_error(y_test, y_pred)
    
    # XGBoost
    modelXGB = XGBRegressor()
    modelXGB.fit(X_train, y_train)
    y_pred = modelXGB.predict(X_test)
    maeXGB = mean_absolute_error(y_test, y_pred)
    mseXGB = mean_squared_error(y_test, y_pred)

    print("{:<12} \nLinear Regression MAE: {:<5} MSE: {:<6} Random Forest MAE: {:<5} MSE: {:<6} XGBoost MAE: {:<5} MSE: {:<6}\n".format(i, round(maeLR, 2), round(mseLR, 2), round(maeRF, 2), round(mseRF, 2), round(maeXGB, 2), round(mseXGB, 2)))