### Preparation of dataset ###

In [37]:
import pandas as pd
import numpy
import matplotlib.pyplot as plt

In [38]:
db = pd.read_csv('test_task_data.csv')

In [39]:
db.head()

Unnamed: 0,Transaction_id,customer_id,Date,Product,Gender,Device_Type,Country,State,State.1,City,Category,Customer_Login_type,Delivery_Type,Transaction_Result,Amount US$,Individual_Price_US$,Year_Month,Time,Quantity
0,40170,1348959766,14/11/2013,Hair Band,Female,Web,United States,New York,New York,New York City,Accessories,Member,one-day deliver,0,6910,576,13-Nov,22:35:51,12
1,33374,2213674919,05/11/2013,Hair Band,Female,Web,United States,California,California,Los Angles,Accessories,Member,one-day deliver,1,1699,100,13-Nov,06:44:41,17
2,14407,1809450308,01/10/2013,Hair Band,Female,Web,United States,Washington,Washington,Seattle,Accessories,Member,Normal Delivery,0,4998,217,13-Oct,00:41:24,23
3,15472,1691227134,04/10/2013,Hair Band,Female,Web,United States,Washington,Washington,Seattle,Accessories,Member,Normal Delivery,0,736,32,13-Oct,22:04:03,23
4,18709,2290737237,12/10/2013,Hair Band,Female,Web,United States,Washington,Washington,Seattle,Accessories,Member,Normal Delivery,1,4389,191,13-Oct,15:00:46,23


There is a dublication of columns. We can drop 'State.1' column

In [40]:
data = db.drop(columns = 'State.1')

Converting 'Amount US' and	'Individual_Price_US' colums to float

In [41]:
data['Amount US$']= data['Amount US$'].str.replace(',','').astype(float)

In [42]:
#There is invalid entry '#VALUE!' and rows with that are removed
data = data[data['Individual_Price_US$'] !='#VALUE!']

In [43]:
data['Individual_Price_US$']= data['Individual_Price_US$'].str.replace(',','').astype(float)

New column with conversion of 'Date' to datetime format and because it is timeseries data we can make sure it is in time order by sorting.

In [44]:
data['Datetime'] = pd.to_datetime(data['Date'], format='%d/%m/%Y')
data_cleaned = data.sort_values(by=['Datetime'])

In [45]:
data_cleaned

Unnamed: 0,Transaction_id,customer_id,Date,Product,Gender,Device_Type,Country,State,City,Category,Customer_Login_type,Delivery_Type,Transaction_Result,Amount US$,Individual_Price_US$,Year_Month,Time,Quantity,Datetime
36222,10181,1272660297,20/09/2013,spectacles,Male,Web,United States,Washington,Seattle,Fashion,Member,one-day deliver,1,2575.0,136.0,13-Sep,15:52:28,19,2013-09-20
9760,10105,1714931982,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,Member,one-day deliver,1,6619.0,509.0,13-Sep,23:02:25,13,2013-09-20
9759,10101,1939632524,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,Member,Normal Delivery,0,4550.0,228.0,13-Sep,04:36:48,20,2013-09-20
9758,10096,1325671900,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,Member,one-day deliver,1,1000.0,83.0,13-Sep,20:32:05,12,2013-09-20
9757,10094,1883062528,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,Member,one-day deliver,1,420.0,35.0,13-Sep,16:04:17,12,2013-09-20
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24947,92748,1578329112,13/01/2014,watNew York Citys,Male,Mobile,United States,California,Los Angles,Electronics,Member,one-day deliver,1,3000.0,188.0,14-Jan,23:31:09,16,2014-01-13
24948,92818,1477771542,13/01/2014,watNew York Citys,Male,Mobile,United States,California,Los Angles,Electronics,Member,one-day deliver,1,5194.0,260.0,14-Jan,16:51:46,20,2014-01-13
24949,92909,1416420740,13/01/2014,watNew York Citys,Male,Mobile,United States,California,Los Angles,Electronics,Member,one-day deliver,1,1130.0,59.0,14-Jan,19:41:17,19,2014-01-13
62877,93294,2113032506,13/01/2014,Shoes,Female,Mobile,United States,California,Los Angles,wearables,Member,one-day deliver,1,0.0,0.0,14-Jan,02:24:56,12,2014-01-13


In [46]:
print(f"data collection period: {data_cleaned['Datetime'].iloc[-1] - data_cleaned['Datetime'].iloc[0]}")

data collection period: 115 days 00:00:00


We need to check when next transaction of customers happened in order to determine if it is less than 2 months. In order to achive that we create new column with next transaction of customer by shifting Datetime of each customer using groupby method.

In [47]:
data_cleaned['next transaction']=data_cleaned.groupby('customer_id')['Datetime'].shift(-1)


The new column 'days_difference' is added which represents period of time of customers between transactions.

In [48]:
data_cleaned['days_difference'] = (data_cleaned['next transaction']-data_cleaned['Datetime']).dt.days

The new column 'output' is added which shows whether each customer's next transaction happened within 2 months (1) or not (0).

In [49]:
data_cleaned['output']=data_cleaned['days_difference'].apply(lambda x: 1 if x <=60  else 0)

In [50]:
data_cleaned

Unnamed: 0,Transaction_id,customer_id,Date,Product,Gender,Device_Type,Country,State,City,Category,...,Transaction_Result,Amount US$,Individual_Price_US$,Year_Month,Time,Quantity,Datetime,next transaction,days_difference,output
36222,10181,1272660297,20/09/2013,spectacles,Male,Web,United States,Washington,Seattle,Fashion,...,1,2575.0,136.0,13-Sep,15:52:28,19,2013-09-20,NaT,,0
9760,10105,1714931982,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,...,1,6619.0,509.0,13-Sep,23:02:25,13,2013-09-20,NaT,,0
9759,10101,1939632524,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,...,0,4550.0,228.0,13-Sep,04:36:48,20,2013-09-20,NaT,,0
9758,10096,1325671900,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,...,1,1000.0,83.0,13-Sep,20:32:05,12,2013-09-20,NaT,,0
9757,10094,1883062528,20/09/2013,Shirt,Male,Web,United States,Washington,Seattle,Clothing,...,1,420.0,35.0,13-Sep,16:04:17,12,2013-09-20,NaT,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24947,92748,1578329112,13/01/2014,watNew York Citys,Male,Mobile,United States,California,Los Angles,Electronics,...,1,3000.0,188.0,14-Jan,23:31:09,16,2014-01-13,NaT,,0
24948,92818,1477771542,13/01/2014,watNew York Citys,Male,Mobile,United States,California,Los Angles,Electronics,...,1,5194.0,260.0,14-Jan,16:51:46,20,2014-01-13,NaT,,0
24949,92909,1416420740,13/01/2014,watNew York Citys,Male,Mobile,United States,California,Los Angles,Electronics,...,1,1130.0,59.0,14-Jan,19:41:17,19,2014-01-13,NaT,,0
62877,93294,2113032506,13/01/2014,Shoes,Female,Mobile,United States,California,Los Angles,wearables,...,1,0.0,0.0,14-Jan,02:24:56,12,2014-01-13,NaT,,0


In this dataset 'output' value which is 1 means if next transaction is within 60 days, 0 otherwise (including if no next transaction)

Let's check how many samples exist in each class

In [51]:
print(f"Number of samples: {len(data_cleaned)}")
print(f"Number of transactions happened after 2 months: {data_cleaned['output'].sum()}")

Number of samples: 65376
Number of transactions happened after 2 months: 506


There is a big imbalance seen between two classes. Over 65376 samples only 26 samples belong to customers did purchase within 2 months. Training ML model with such highly imbalanced dataset can result in different problems:

*   The model will likely become biased towared the majerity class (customers haven't done next transaction in 2 months), At the result, the model might predict the majority class most of the time, ignoring the minority class
*   This can cause to high accuracy but lower performance on the minority class. It means that common metrics can be misleading.



New dataset with following selected columns is chosen to train model: 'Product', 'Gender',
       'Device_Type', 'Country', 'State', 'City', 'Category',
       'Customer_Login_type', 'Delivery_Type', 'Transaction_Result',
       'Amount US$', 'Individual_Price_US$', 'Quantity', 'output'

In [52]:
features = ['Product', 'Gender',
       'Device_Type', 'Country', 'State', 'City', 'Category',
       'Customer_Login_type', 'Delivery_Type', 'Transaction_Result',
       'Amount US$', 'Individual_Price_US$', 'Quantity', 'output']

In [53]:
dataset = data_cleaned[features]

In [54]:
dataset = dataset.dropna()

Let's check how many samples exist in each class after removing rows with NaN values

In [55]:
print(f"Number of samples: {len(dataset)}")
print(f"Number of transactions happened after 2 months: {dataset['output'].sum()}")

Number of samples: 65375
Number of transactions happened after 2 months: 506


### Data Preparation for model training, - Feature Engineering###

Some machine learning models can be sensitive to the magnitude of the features. That's why 'Amount US', 'Individual_Price_US', 'Quantity' columns are rescaled using StandardScaler method of scikit-learn.

In [56]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
columns_to_scale = ['Amount US$', 'Individual_Price_US$', 'Quantity']

dataset[columns_to_scale] = scaler.fit_transform(dataset[columns_to_scale])

Input and output data for model training:

In [57]:
x = dataset[['Product', 'Gender',
       'Device_Type', 'Country', 'State', 'City', 'Category',
       'Customer_Login_type', 'Delivery_Type', 'Transaction_Result',
       'Amount US$', 'Individual_Price_US$', 'Quantity']]
y = dataset['output']

Some columns in dataset contains categorical variables and built-in One-hot encoding method of Pandas is used for conversion.

In [58]:
x = pd.get_dummies(data=x)

In [79]:
x.columns

Index(['Transaction_Result', 'Amount US$', 'Individual_Price_US$', 'Quantity',
       'Product_Bag', 'Product_Books', 'Product_Cycle',
       'Product_Fairness Cream', 'Product_Hair Band', 'Product_Jean',
       'Product_Pen Drive', 'Product_Shirt', 'Product_Shoes',
       'Product_spectacles', 'Product_vessels', 'Product_watNew York Citys',
       'Gender_Female', 'Gender_Male', 'Device_Type_Mobile', 'Device_Type_Web',
       'Country_United States', 'State_California', 'State_New York',
       'State_Washington', 'City_Los Angles', 'City_New York City',
       'City_Seattle', 'Category_Accessories', 'Category_Clothing',
       'Category_Electronics', 'Category_Fashion', 'Category_House hold',
       'Category_Vehicle', 'Category_stationaries', 'Category_wearables',
       'Customer_Login_type_First SignUp', 'Customer_Login_type_Guest',
       'Customer_Login_type_Member', 'Customer_Login_type_New ',
       'Delivery_Type_Normal Delivery', 'Delivery_Type_one-day deliver'],
      dtype

Train and test split of datasets:

In [59]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    x, y, test_size=0.33, random_state=42)

### Training of first model - Random Forest Classifier with unbalanced dataset ###

In [60]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
target_names = ['class 0', 'class 1']
classification_report_output = classification_report(y_test, y_pred, target_names=target_names)


In [61]:
print(f"Accuracy of Random Forest classifier: {accuracy}")

Accuracy of Random Forest classifier: 0.9910540465374988


In [62]:
print(classification_report_output)

              precision    recall  f1-score   support

     class 0       0.99      1.00      1.00     21389
     class 1       0.39      0.08      0.13       185

    accuracy                           0.99     21574
   macro avg       0.69      0.54      0.56     21574
weighted avg       0.99      0.99      0.99     21574



### Balancing dataset by resampling ###

As our dataset is imbalanced, the metrics for trained model can be misleading. Sampling method is used for balancing classes by over sampling minority class.

In [63]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)

X_resampled, y_resampled = ros.fit_resample(x, y)

In [64]:
print(f"Number of samples: {len(X_resampled)}")
print(f"Number of transactions happened after 2 months: {y_resampled.sum()}")

Number of samples: 129738
Number of transactions happened after 2 months: 64869


Even though classes are balanced right now it can have some issues. The model might overfit to the minority class, specially if the oversampled dataset includes samples that are not diverse enough.

### Training of first model - Random Forest Classifier with resampled dataset ###

Random Forest classifier with resampled dataset:

In [65]:
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.33, random_state=42)

model_resampled = RandomForestClassifier(random_state=42)
model_resampled.fit(X_train, y_train)

y_pred_resampled = model_resampled.predict(X_test)

accuracy_resampled = accuracy_score(y_test, y_pred_resampled)
classification_report_resampled = classification_report(y_test, y_pred_resampled, target_names=target_names)

print(f"Accuracy of Random Forest classifier with resampled dataset: {accuracy_resampled}")
print(classification_report_resampled)

Accuracy of Random Forest classifier with resampled dataset: 0.9823655813518942
              precision    recall  f1-score   support

     class 0       1.00      0.97      0.98     21475
     class 1       0.97      1.00      0.98     21339

    accuracy                           0.98     42814
   macro avg       0.98      0.98      0.98     42814
weighted avg       0.98      0.98      0.98     42814



The metrics as accuracy, precision, recall and f1-score is quite high. However, there is potential that it is misleading because of imbalance issue existed in dataset.

### Training of second model - Support Vector Machine with resampled dataset ###

In [66]:
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC

svm = SVC(kernel="rbf", gamma=0.5, C=1.0)

svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

accuracy_resampled = accuracy_score(y_test, y_pred)
classification_report_resampled = classification_report(y_test, y_pred, target_names=target_names)

print(f"Accuracy of Support Vector Machine with resampled dataset {accuracy_resampled}")
print(classification_report_resampled)

Accuracy of Support Vector Machine with resampled dataset 0.7848600924931097
              precision    recall  f1-score   support

     class 0       0.84      0.71      0.77     21475
     class 1       0.75      0.86      0.80     21339

    accuracy                           0.78     42814
   macro avg       0.79      0.79      0.78     42814
weighted avg       0.79      0.78      0.78     42814



The second model SVM also gave highest performance on test dataset with same metrics.

### Saving model files for deployment ###

It is necessary to save the preprocessing data alongside with model file. We need to repeat same steps used during preprocessing in inference.

In [85]:
import joblib

Saving encoded column names for using in deployment.

In [86]:
encoded_columns = x.columns

joblib.dump(encoded_columns, 'encoded_columns.pkl')

['encoded_columns.pkl']

As we scaled some column values, it is necessary to save scaling parameters for deployment

In [83]:
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']

Saving SVM model file.

In [84]:
joblib.dump(svm, 'model.pkl')

['model.pkl']

### Data insights for team ###

Data insights about 506 customers who did next transaction within 2 months:

In [67]:
users_2m = data_cleaned[data_cleaned['output']==1]

In [68]:
users_2m['Device_Type'].value_counts()

Unnamed: 0_level_0,count
Device_Type,Unnamed: 1_level_1
Web,400
Mobile,106


In [69]:
users_2m['City'].value_counts()

Unnamed: 0_level_0,count
City,Unnamed: 1_level_1
Seattle,361
Los Angles,135
New York City,10


In [70]:
users_2m['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
Fashion,289
Clothing,178
stationaries,18
wearables,16
Electronics,4
House hold,1


From some basic analysis, it seems that most of the customers who did next transaction within 2 months are web users, mostly from Seattle and did purchase in Fashion and Clothing category.

Let's take a look at the demographics of customers who did tansaction after 2 months.

In [71]:
users_more_2months = data_cleaned[data_cleaned['days_difference']>60]

In [72]:
print(f"Number of customers who did transaction after 2 months: {len(users_more_2months)}")

Number of customers who did transaction after 2 months: 26


In [73]:
users_more_2months['Device_Type'].value_counts()

Unnamed: 0_level_0,count
Device_Type,Unnamed: 1_level_1
Web,26


In [74]:
users_more_2months['City'].value_counts()

Unnamed: 0_level_0,count
City,Unnamed: 1_level_1
Seattle,19
Los Angles,7


In [75]:
users_more_2months['Category'].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
Clothing,17
Fashion,9


As it is seen customers who did purchase within and after 2 months are same. We can say that they are loyal customers and marketing team can focus on campaigns for other demographics.