# Next Purchase Date with ML - Top Items

In this use case, we use machine learning approach to know the model's performance if trained with some items with the highest number of transactions. Hopefully, we can achieve the best score because using the largest number of transactions compared to the others will make the model learn and perform better.

## Load and Preprocess Data

We use more than 500k transaction data between users and items from the EPM database. The raw data still has some returning transactions with a negative amount, but we are only looking for buying transactions. Each transaction has a timestamp record daily. Because a user can buy the same item multiple times on the same day, we consider it a single data aggregating the sales quantity column.

In [1]:
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

In [2]:
# Load Dataset
df = pd.read_csv('/kaggle/input/epm-prep/EPM.csv')
df = df.drop(['Unnamed: 0', 'principal_code'], axis=1)
df.head()

Unnamed: 0,trx_date,customer_name,ship_to_id,branch_code,item_code,item_desc,principal_desc,gross_sales_amount,sales_qty
0,2021-08-20,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,KCFMB,CEFIXIME 100MG 50 KAPSUL,HEXPHARM (PHARMAMED),195000,3
1,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,CKCOA,KALCINOL N CREAM 5 GR,KALBE NIMITZ (PHARMAMED),28500,3
2,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TMFNB,METFORMIN HCL 200 TABLET,HEXPHARM (PHARMAMED),175000,5
3,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TBSVC,BRONSOLVAN 100 TABLET,HEXPHARM TSJ (PHARMAMED),35000,1
4,2021-08-20,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TALNF,AMLODIPINE BESILATE 10MG,HEXPHARM (PHARMAMED),255000,3


In [3]:
# Filter negative transactions
df = df[(df['sales_qty'] > 0) & (df['gross_sales_amount'] > 0)]
df['trx_date'] = pd.to_datetime(df['trx_date'])
df.head()

Unnamed: 0,trx_date,customer_name,ship_to_id,branch_code,item_code,item_desc,principal_desc,gross_sales_amount,sales_qty
0,2021-08-20,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,KCFMB,CEFIXIME 100MG 50 KAPSUL,HEXPHARM (PHARMAMED),195000,3
1,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,CKCOA,KALCINOL N CREAM 5 GR,KALBE NIMITZ (PHARMAMED),28500,3
2,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TMFNB,METFORMIN HCL 200 TABLET,HEXPHARM (PHARMAMED),175000,5
3,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TBSVC,BRONSOLVAN 100 TABLET,HEXPHARM TSJ (PHARMAMED),35000,1
4,2021-08-20,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TALNF,AMLODIPINE BESILATE 10MG,HEXPHARM (PHARMAMED),255000,3


In [4]:
# Drop duplicate transactions
temp = df[['ship_to_id', 'item_code', 'trx_date', 'sales_qty']].groupby(['ship_to_id', 'item_code', 'trx_date']).sum().reset_index(drop=True)
df = df.drop_duplicates(['ship_to_id', 'item_code', 'trx_date']).reset_index(drop=True)
df['sales_qty'] = temp
df.head()

Unnamed: 0,trx_date,customer_name,ship_to_id,branch_code,item_code,item_desc,principal_desc,gross_sales_amount,sales_qty
0,2021-08-20,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,KCFMB,CEFIXIME 100MG 50 KAPSUL,HEXPHARM (PHARMAMED),195000,1
1,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,CKCOA,KALCINOL N CREAM 5 GR,KALBE NIMITZ (PHARMAMED),28500,1
2,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TMFNB,METFORMIN HCL 200 TABLET,HEXPHARM (PHARMAMED),175000,1
3,2021-11-02,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TBSVC,BRONSOLVAN 100 TABLET,HEXPHARM TSJ (PHARMAMED),35000,1
4,2021-08-20,JK1-AP. DEVITA_GROUP_NA,EPM_34950,JK1,TALNF,AMLODIPINE BESILATE 10MG,HEXPHARM (PHARMAMED),255000,2


## Preproecssing for Usecase

We don't use all the features but only focus on the sales quantity and timestamp column.

In the process of searching the selected data, we do these steps:
1. Sort the unique items from the data based on the number of transactions.
2. Choose the top x items with the highest number of transactions (in this case, x=1).
3. From the selected item, sort the unique users from the data related to the chosen item only.
4. Choose the top y users with the highest number of transactions (in this case, y=10).

In [5]:
# Create timestamp column
df['timestamp'] = df['trx_date'].apply(lambda x: x.timestamp())
df['trx_date'] = pd.to_datetime(df['trx_date'])
df = df[['ship_to_id', 'item_code', 'trx_date', 'sales_qty', 'timestamp']]
df.head()

Unnamed: 0,ship_to_id,item_code,trx_date,sales_qty,timestamp
0,EPM_34950,KCFMB,2021-08-20,1,1629418000.0
1,EPM_34950,CKCOA,2021-11-02,1,1635811000.0
2,EPM_34950,TMFNB,2021-11-02,1,1635811000.0
3,EPM_34950,TBSVC,2021-11-02,1,1635811000.0
4,EPM_34950,TALNF,2021-08-20,2,1629418000.0


In [6]:
# Look for top items with the most transactions 
dfSortItem = df.groupby(['item_code']).count().sort_values('trx_date', ascending=False).reset_index()
dfSortItem[['item_code', 'ship_to_id']]

Unnamed: 0,item_code,ship_to_id
0,TALNE,7118
1,TMFNB,5631
2,TALNF,5593
3,TPRGR,4734
4,TPODA,4733
...,...,...
1284,SRIEM,1
1285,KHLBQ,1
1286,KHLBS,1
1287,TR1D1,1


In [7]:
# Show data with the first top item
itemMax = dfSortItem.loc[0, 'item_code']
dfItemMax = df[df['item_code'] == itemMax].reset_index(drop=True)
dfItemMax

Unnamed: 0,ship_to_id,item_code,trx_date,sales_qty,timestamp
0,EPM_34950,TALNE,2021-08-20,4,1.629418e+09
1,EPM_34950,TALNE,2021-06-14,2,1.623629e+09
2,EPM_34950,TALNE,2021-02-17,3,1.613520e+09
3,EPM_34950,TALNE,2021-01-27,4,1.611706e+09
4,EPM_34950,TALNE,2021-12-06,2,1.638749e+09
...,...,...,...,...,...
7113,EPM_4504186,TALNE,2023-06-13,6,1.686614e+09
7114,EPM_4523526,TALNE,2023-07-25,1,1.690243e+09
7115,EPM_4523526,TALNE,2023-08-30,1,1.693354e+09
7116,EPM_4524809,TALNE,2023-09-12,1,1.694477e+09


In [8]:
# Look for top users with most transactions with the related item
dfItemMaxSortUser = dfItemMax.groupby(['ship_to_id']).count().sort_values('trx_date', ascending=False).reset_index()
dfItemMaxSortUser.head(10)

Unnamed: 0,ship_to_id,item_code,trx_date,sales_qty,timestamp
0,EPM_35159,178,178,178,178
1,EPM_4334085,168,168,168,168
2,EPM_1807311,165,165,165,165
3,EPM_3564728,164,164,164,164
4,EPM_35002,153,153,153,153
5,EPM_34985,137,137,137,137
6,EPM_136080,133,133,133,133
7,EPM_1624002,126,126,126,126
8,EPM_3676050,125,125,125,125
9,EPM_34923,124,124,124,124


In [9]:
top = [i for i in range(10)]
usersMax = dfItemMaxSortUser.loc[top, 'ship_to_id']
list(usersMax)

['EPM_35159',
 'EPM_4334085',
 'EPM_1807311',
 'EPM_3564728',
 'EPM_35002',
 'EPM_34985',
 'EPM_136080',
 'EPM_1624002',
 'EPM_3676050',
 'EPM_34923']

## Training Model

After choosing a specific user and item, we put those data into the model. We try three different models and compare the performance. The models are trained with 80% of the data and evaluated with the rest by the Mean Average Error and Mean Squared Error metrics.

In [10]:
dfTrain = pd.DataFrame()
dfTest = pd.DataFrame()

for i in usersMax:

    dfUser = df[(df['item_code'] == itemMax) & (df['ship_to_id'] == i)].sort_values('trx_date').reset_index(drop=True)
    dfUser['period'] = dfUser['trx_date'].diff().apply(lambda x: x.days)[1:].reset_index(drop=True)
    dfTrain, dfTest = dfUser[:-1], dfUser[-1:] # The last row doesnt have interval value for next purchase

    dfTest = dfTest.drop(['period', 'ship_to_id', 'item_code', 'trx_date'], axis=1)
    X = dfTrain.drop(['period', 'ship_to_id', 'item_code', 'trx_date'], axis=1)
    y = dfTrain['period']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Linear Regression
    modelLR = LinearRegression()
    modelLR.fit(X_train, y_train)
    y_pred = modelLR.predict(X_test)
    maeLR = mean_absolute_error(y_test, y_pred)
    mseLR = mean_squared_error(y_test, y_pred)

    # Random Forest
    modelRF = RandomForestRegressor(random_state=42)
    modelRF.fit(X_train, y_train)
    y_pred = modelRF.predict(X_test)
    maeRF = mean_absolute_error(y_test, y_pred)
    mseRF = mean_squared_error(y_test, y_pred)
    
    # XGBoost
    modelXGB = XGBRegressor()
    modelXGB.fit(X_train, y_train)
    y_pred = modelXGB.predict(X_test)
    maeXGB = mean_absolute_error(y_test, y_pred)
    mseXGB = mean_squared_error(y_test, y_pred)

    print("{:<12} \nLinear Regression MAE: {:<5} MSE: {:<6} Random Forest MAE: {:<5} MSE: {:<6} XGBoost MAE: {:<5} MSE: {:<6}\n".format(i, round(maeLR, 2), round(mseLR, 2), round(maeRF, 2), round(mseRF, 2), round(maeXGB, 2), round(mseXGB, 2)))

EPM_35159    
Linear Regression MAE: 2.84  MSE: 23.12  Random Forest MAE: 3.07  MSE: 23.62  XGBoost MAE: 3.23  MSE: 23.54 

EPM_4334085  
Linear Regression MAE: 1.91  MSE: 7.2    Random Forest MAE: 2.34  MSE: 13.09  XGBoost MAE: 2.31  MSE: 9.71  

EPM_1807311  
Linear Regression MAE: 2.65  MSE: 9.21   Random Forest MAE: 3.7   MSE: 28.12  XGBoost MAE: 3.83  MSE: 28.95 

EPM_3564728  
Linear Regression MAE: 4.36  MSE: 67.18  Random Forest MAE: 4.74  MSE: 51.96  XGBoost MAE: 4.35  MSE: 32.82 

EPM_35002    
Linear Regression MAE: 2.44  MSE: 9.15   Random Forest MAE: 2.94  MSE: 12.29  XGBoost MAE: 3.18  MSE: 17.92 

EPM_34985    
Linear Regression MAE: 5.33  MSE: 45.32  Random Forest MAE: 5.0   MSE: 46.68  XGBoost MAE: 6.1   MSE: 77.85 

EPM_136080   
Linear Regression MAE: 2.36  MSE: 9.62   Random Forest MAE: 3.79  MSE: 39.41  XGBoost MAE: 5.38  MSE: 78.8  

EPM_1624002  
Linear Regression MAE: 4.12  MSE: 23.38  Random Forest MAE: 3.61  MSE: 22.75  XGBoost MAE: 4.33  MSE: 34.78 

EPM_3676