****E-Commerce & Retail B2B Classification Case Study****

***Problem Statement***

Schuster is a multinational retail company dealing in sports goods and accessories. Schuster conducts significant business with hundreds of its vendors, with whom it has credit arrangements. Unfortunately, not all vendors respect credit terms and some of them tend to make payments late. Schuster levies heavy late payment fees, although this procedure is not beneficial to either party in a long-term business relationship. The company has some employees who keep chasing vendors to get the payment on time; this procedure nevertheless also results in non-value-added activities, loss of time and financial impact. Schuster would thus try to understand its customers’ payment behaviour and predict the likelihood of late payments against open invoices.



To understand how to approach this problem using data science, let’s first understand the payment process at Schuster now. Every time a transaction of goods takes place with a vendor, the accounting team raises an invoice and shares it with the vendor. This invoice contains the details of the goods, the invoice value, the creation date and the payment due date based on the credit terms as per the contract. Business with these vendors occurs quite frequently. Hence, there are always multiple invoices associated with each vendor at any given time.

***Goal***

Schuster would like to better understand the customers’ payment behaviour based on their past payment patterns (customer segmentation).
Using historical information, it wants to be able to predict the likelihood of delayed payment against open invoices from its customers.
It wants to use this information so that collectors can prioritise their work in following up with customers beforehand to get the payments on time.
To summarise, as a business analyst, you want to find the answer to these questions:

How can we analyse the customer transactions data to find different payment behaviours?
In which way can you segregate the customers based on their previous payment patterns/behaviours?
Based on the historical data, can you predict the likelihood of delayed payment against open invoices from the customers?
Can you draw any business insights based on your developed model?


Overall, you need to build a model with the primary objective of identifying important predictor attributes that will help the business understand indicators of late payment. You have to recommend the classification model that you would finally deploy for production and explain why you recommend it.



**Data Understanding**

RECEIPT_METHOD	In which method payments have been made

CUSTOMER_NAME	Name of the customer/vendor

CUSTOMER_NUMBER	Customer's unique identity number

RECEIPT_DOC_NO	Reference number of the payment receipt

RECEIPT_DATE	The date in which the payment has been made

CLASS	As the payment against these invoices have already been received so Transaction Class as PMT (short for Payment) assigned

CURRENCY_CODE	Currency used for the payment

Local Amount	Invoice value in local currency

USD Amount	Invoice Value converted to USD

INVOICE_ALLOCATED	Invoice number that has been allocated to a particular vendor

INVOICE_CREATION_DATE	The date on which the invoice was created

DUE_DATE	The date by which the payment was to be made

PAYMENT_TERM	Days given to the vendor/customer for making the payments

INVOICE_CLASS	Three types of Invoice classes - Credit Memo or Credit Note (CM), Debit Memo or Debit Note (DM) or Invoice (INV)

INVOICE_CURRENCY_CODE	Currency code as per the invoice generated

INVOICE_TYPE	Invoice created for physical goods or services (non-goods)

Finally target variable will be derived based on the suggested information "You need to derive it by checking whether the payment receipt date falls within, or after the due date. By doing so, you can create your binary target variable as 1 or 0."

In [1]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')
# Importing Pandas and NumPy
import pandas as pd, numpy as np

In [2]:
# Importing Received dataset and looking at the first 5 records
train = pd.read_excel("G:/M/Domain Oriented Case Study/E-Commerce & Retail B2B Case Study/Received_Payments_Data.xlsx")
train.head()

FileNotFoundError: [Errno 2] No such file or directory: 'G:/M/Domain Oriented Case Study/E-Commerce & Retail B2B Case Study/Received_Payments_Data.xlsx'

**Summary Statistics and EDA on train data**

In [None]:
##min, max and average value of invoice value in USD
print('minimum invoice value: ',train['USD Amount'].min())
print('maximum invoice value: ',train['USD Amount'].max())
print('average invoice value: ',np.round(train['USD Amount'].mean(),1))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train[['INVOICE_CREATION_DATE','DUE_DATE', 'AS_OF_DATE']] = train[['INVOICE_CREATION_DATE','DUE_DATE', 'AS_OF_DATE']].apply(pd.to_datetime)

In [None]:
#0: Delayed
#1: On time
#creating the target variable
train['target'] = np.where(train['INVOICE_CREATION_DATE']>train['DUE_DATE'], 0, 1)

In [None]:
#counts of unique values
train['target'].value_counts()

In [None]:
#average invoice value for delayed customers
print('Average Invoice value for delayed customers: ',np.round(train[train['target']==0]['USD Amount'].mean(),2))

In [None]:
#checking the basic information about the columns
train.info()

In [None]:
#Plotting the distribution of invoice value
train['USD Amount'].plot.hist()

In [None]:
import seaborn as sns
sns.set_theme(style="whitegrid")

In [None]:
sns.countplot(x=train["target"])

In [None]:
c = train.target.value_counts()
p = np.round(train.target.value_counts(normalize=True)*100,2)
pd.concat([c,p], axis=1, keys=['counts', '%'])

Here we can see, approximately 4% of the customers are marked as 'Delayed'
Clearly class imbalance is the isuue and we will deal it in the model building process

In [None]:
#delayed and on-time customer unique value counts with the payment class
sns.countplot(data=train, x="CLASS", hue="target")

In [None]:
#unique value counts of payment terms
train['PAYMENT_TERM'].value_counts()

In [None]:
#unique value count distribution of invoice class
train['INVOICE_CLASS'].value_counts()

In [None]:
#unique value count distribution of invoice type
train['INVOICE_TYPE'].value_counts()

In [None]:
#delayed and on-time customer distribution across the invoice class categories
sns.countplot(data=train, x='INVOICE_CLASS', hue="target")

In [None]:
#delayed and on-time customer distribution across the invoice type categories
sns.countplot(data=train, x='INVOICE_TYPE', hue="target")

In [None]:
#Multivariate analysis
sns.barplot(data=train, x="INVOICE_TYPE", y="USD Amount", hue="target")

invoice amount showing pretty high for delayed payment customers in Goods invoice type

In [None]:
#Multivariate analysis
sns.barplot(data=train, x="INVOICE_CLASS", y="USD Amount", hue="target")

credit card payment mode accounts highest invoice amount across all the invoice classes for on-time customers 

In [None]:
#distribution of the invoice amount(USD)
train['USD Amount'].plot(kind='hist')

In [None]:
#variable transformation
#method: cube root 
train['cbrt_USD_Amount'] = np.cbrt(train['USD Amount'])

In [None]:
train['cbrt_USD_Amount'].plot(kind='hist')

In [None]:
#The age is calculated in days by taking the difference between Transaction Date and Due Date
train['age']=(train['INVOICE_CREATION_DATE']-train['DUE_DATE']).dt.days

In [None]:
train[train['target']==1].sample(5)

**Clustering - Customer Segmentation**

**Recommendation Given:** Customer-level attributes could also be important independent variables to be included in the model.
A customer-level attribute can be determined via customer segmentation. You have to segment your customers based on
two derived variables: the average payment time in days for a customer and the standard deviation for the payment time.
Using clustering techniques would result in a few distinct clusters of customers, which can be used as an input variable
for the ML model.

In [None]:
clustering_data = train[['Customer Type','target','Local Amount','age','INVOICE_CLASS','INVOICE_CREATION_DATE']]

In [None]:
clustering_data = clustering_data.applymap(lambda s: s.lower() if type(s) == str else s)

In [None]:
clustering_data.columns= clustering_data.columns.str.lower()

In [None]:
clustering_data['std'] = clustering_data[['age','local amount']].std(axis=1)

In [None]:
clustering_data['target'] = clustering_data['target'].astype(str)

In [None]:
clustering_data = clustering_data.loc[clustering_data['age']<=1200]

In [None]:
median = clustering_data["age"].median()
clustering_data["age"] = np.where(clustering_data["age"] >400, median,clustering_data['age'])

In [None]:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

In [None]:
# populate list of numerical and categorical variables
num_list = []
cat_list = []

for column in clustering_data:
    if is_numeric_dtype(clustering_data[column]):
        num_list.append(column)
    elif is_string_dtype(clustering_data[column]):
        cat_list.append(column)
        

print("numeric:", num_list)
print("categorical:", cat_list)

In [None]:
for column in clustering_data:
    plt.figure(column, figsize = (5,5))
    plt.title(column)
    if is_numeric_dtype(clustering_data[column]):
        clustering_data[column].plot(kind = 'hist')
    elif is_string_dtype(clustering_data[column]):
        # show only the TOP 10 value count in each categorical data
        clustering_data[column].value_counts()[:10].plot(kind = 'bar')

In [None]:
# encoding categorical variable
from sklearn.preprocessing import LabelEncoder

clustering_data['customer type'] = LabelEncoder().fit_transform(clustering_data["customer type"])
clustering_data['invoice_class'] = LabelEncoder().fit_transform(clustering_data["invoice_class"])


In [None]:
clustering_data.columns

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler


def data_scaler(scaler, var):
    scaled_var = "scaled_" + var
    model = scaler.fit(df[var].values.reshape(-1,1))
    df[scaled_var] = model.transform(df[var].values.reshape(-1, 1))
    
    plt.figure(figsize = (5,5))
    plt.title(scaled_var)
    df[scaled_var].plot(kind = 'hist')
    
    plt.figure(figsize = (5,5))
    plt.title(var)
    df[var].plot(kind = 'hist')

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

for var in ["age", "local amount"]:
    scaled_var = "scaled_" + var
    model = scaler.fit(clustering_data[var].values.reshape(-1,1))
    clustering_data[scaled_var] = model.transform(clustering_data[var].values.reshape(-1, 1))

In [None]:
plt.figure(figsize = (5,5))
plt.title('scaled_age')
clustering_data['scaled_age'].plot(kind = 'hist')

In [None]:
plt.figure(figsize = (5,5))
plt.title('scaled_local amount')
clustering_data['scaled_local amount'].plot(kind = 'hist')

In [None]:
import seaborn as sns
columns = ['scaled_age','scaled_local amount']
#plt.figure(figsize = (10,20))
g = sns.pairplot(clustering_data[columns])
g.fig.set_size_inches(15,5)

In [None]:
# Load packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
sns.set_style('darkgrid')

In [None]:
X = np.array(clustering_data.loc[:,['std',                # Choose the variable names
                       'age']])    \
                        .reshape(-1, 2)

In [None]:
# Determine optimal cluster number with elbow method
wcss = []

In [None]:
for i in range(1, 11):
    model = KMeans(n_clusters = i,     
                    init = 'k-means++',                 # Initialization method for kmeans
                    max_iter = 300,                     # Maximum number of iterations 
                    n_init = 10,                        # Choose how often algorithm will run with different centroid 
                    random_state = 0)                   # Choose random state for reproducibility
    model.fit(X)                              
    wcss.append(model.inertia_)

In [None]:
# Show Elbow plot
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')                               # Set plot title
plt.xlabel('Number of clusters')                        # Set x axis name
plt.ylabel('Within Cluster Sum of Squares (WCSS)')      # Set y axis name
plt.show()

In [None]:
kmeans = KMeans(n_clusters = 3,                 # Set amount of clusters
                init = 'k-means++',             # Initialization method for kmeans
                max_iter = 300,                 # Maximum number of iterations
                n_init = 10,                    # Choose how often algorithm will run with different centroid
                random_state = 0)               # Choose random state for reproducibility

pred_y = kmeans.fit_predict(X)

In [None]:
# Plot the data
plt.scatter(X[:,0], 
            X[:,1])

# Plot the clusters 
plt.scatter(kmeans.cluster_centers_[:, 0], 
            kmeans.cluster_centers_[:, 1], 
            s=1000,                             # Set centroid size
            c='red',
           alpha=0.5)                           # Set centroid color
plt.show()

We can see that average days of the payment time are segmented in three main zones: 0-1 standard deviation of payment time, 2 standard deviation of payment time and 4 standard deviation of payment time

**Data Preparation**

In [None]:
#dropping the date columns as the necessary information has been derived before
train = train.drop(['RECEIPT_DATE','AS_OF_DATE','DUE_DATE','INVOICE_CREATION_DATE'],axis=1)

In [None]:
train.head()

In [None]:
#making all lower-case
train = train.applymap(lambda s: s.lower() if type(s) == str else s)

In [None]:
train.columns

In [None]:
#making the header name lower-case
train.columns= train.columns.str.lower()

In [None]:
train.columns

In [None]:
train.isnull().sum()

In [None]:
numeric_data = train.select_dtypes(include=[np.number])
categorical_data = train.select_dtypes(exclude=[np.number])

In [None]:
#dropping unnecessary numeric columns
numeric_data = numeric_data.drop(['customer_number', 'receipt_doc_no', 'local amount', 'usd amount'],axis=1)

In [None]:
numeric_data.columns

In [None]:
numeric_data.isnull().sum()

In [None]:
categorical_data.columns

In [None]:
#dropping unnecessary categorical columns
categorical_data = categorical_data.drop(['receipt_method', 'customer_name','invoice_allocated', 'currency_code', 'class'],axis=1)

In [None]:
categorical_data.columns

In [None]:
#dummy encoding
encoded_cols = pd.get_dummies(train[categorical_data.columns], drop_first=True)

In [None]:
data = pd.concat([numeric_data,encoded_cols], axis=1)

In [None]:
data.shape

In [None]:
data.sample(5)

In [None]:
data.shape

**train-test split**

In [None]:
from sklearn.model_selection import train_test_split
X = data.drop(['target'],axis=1)
y = data[['target']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
#apply standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
num_vars = ['cbrt_usd_amount','age']

In [None]:
data[num_vars] = scaler.fit_transform(data[num_vars])

In [None]:
scaler.fit(X_train[['cbrt_usd_amount','age']])
X_train[['cbrt_usd_amount','age']] = scaler.transform(X_train[['cbrt_usd_amount','age']])

In [None]:
X_test[['cbrt_usd_amount','age']] = scaler.transform(X_test[['cbrt_usd_amount','age']])

**model building**

In [None]:
import statsmodels.api as sm
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

In [None]:
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression()

In [None]:
logisticRegr.fit(X_train, y_train)

In [None]:
# Use score method to get accuracy of model
prediction = logisticRegr.predict(X_train)
test_score = logisticRegr.score(X_test, y_test)
test_score

In [None]:
print('X_train shape: ',X_train.shape)
print('X_test shape: ',X_test.shape)

In [None]:
train_pred = logisticRegr.predict(X_train)
y_pred = logisticRegr.predict(X_test)

In [None]:
from sklearn import metrics
train_score = metrics.accuracy_score(y_train, train_pred)
test_score = metrics.accuracy_score(y_test, y_pred)
print("train score", train_score)
print("test score", test_score)

In [None]:
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

#conf_mat = confusion_matrix(y_test, y_pred)
#class_names=[0,1]
#print(conf_mat)
#plt.figure(figsize=(10,5))

#plt.grid(False)
plt.figure(figsize=(10,5))
plot_confusion_matrix(logisticRegr, X_test, y_test, cmap="summer",
                                colorbar=True)  
plt.grid(False)
plt.show()

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
# predict probabilities on Test and take probability for class 1([:1])
y_pred_prob_test = logisticRegr.predict_proba(X_test)[:, 1]
#predict labels on test dataset
y_pred_test = logisticRegr.predict(X_test)
# create onfusion matrix
cm = confusion_matrix(y_test, y_pred_test)
print("confusion Matrix is :\n\n",cm)
print("\n")
# ROC- AUC score
print("ROC-AUC score  test dataset:  \t", metrics.roc_auc_score(y_test,y_pred_prob_test))
#Precision score
print("precision score  test dataset:  ", metrics.precision_score(y_test,y_pred_test))
#Recall Score
print("Recall score  test dataset:  \t", metrics.recall_score(y_test,y_pred_test))
#f1 score
print("f1 score  test dataset :  \t", metrics.f1_score(y_test,y_pred_test))

Challenges related to imbalanced dataset
1. Biased predictions
2. Misleading accuracy

We will check with two efficient techniques: ADASYN and SMOTE+TOMEK

**Dealing with class imbalance techniques**

In [None]:
#ADASYN
from collections import Counter
from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=45, n_neighbors=5)
X_resampled_ada, y_resampled_ada = ada.fit_resample(X_train, y_train)
len(X_resampled_ada)

In [None]:
print(sorted(Counter(y_resampled_ada).items()))

In [None]:
lreg_ada = LogisticRegression()
lreg_ada.fit(X_resampled_ada, y_resampled_ada)

y_pred_ada = lreg_ada.predict(X_test)

In [None]:
from sklearn import metrics 
print ('Accuracy: ', metrics.accuracy_score(y_test, y_pred_ada))
print ('F1 score: ', metrics.f1_score(y_test, y_pred_ada))
print ('Recall: ', metrics.recall_score(y_test, y_pred_ada))
print ('Precision: ', metrics.precision_score(y_test, y_pred_ada))
print ('\n clasification report:\n', metrics.classification_report(y_test,y_pred_ada))
print ('\n confussion matrix:\n',metrics.confusion_matrix(y_test, y_pred_ada))

ADASYN is an automatic first choice to handle the class imbalance problem as it's an extension of SMOTE where the minority examples are generated based on their density distribution.
More synthetic data are generated from minority class samples that are harder to learn as compared with those minority samples that are easier to learn.

SMOTE+TOMEK Combining Oversampling and Undersampling

1. Tomek links can be used as an under-sampling method or as a data cleaning method.
2. Tomek links to the over-sampled training set as a data cleaning method. Thus, instead of removing only the majority class examples that from Tomek links, examples from both classes are removed.

In [None]:
#SMOTE+TOMEK
from imblearn.combine import SMOTETomek
smt_tmk = SMOTETomek(random_state=45)
X_resampled_smt_tmk, y_resampled_smt_tmk = smt_tmk.fit_resample(X_train, y_train)
len(X_resampled_smt_tmk)

In [None]:
print(sorted(Counter(y_resampled_smt_tmk).items()))

In [None]:
lreg_smt_tmk = LogisticRegression()
lreg_smt_tmk.fit(X_resampled_smt_tmk, y_resampled_smt_tmk)

y_pred_smt_tmk = lreg_smt_tmk.predict(X_test)

In [None]:
print ('Accuracy: ', metrics.accuracy_score(y_test, y_pred_smt_tmk))
print ('F1 score: ', metrics.f1_score(y_test, y_pred_smt_tmk))
print ('Recall: ', metrics.recall_score(y_test, y_pred_smt_tmk))
print ('Precision: ', metrics.precision_score(y_test, y_pred_smt_tmk))
print ('\n clasification report:\n', metrics.classification_report(y_test,y_pred_smt_tmk))
print ('\n confussion matrix:\n',metrics.confusion_matrix(y_test, y_pred_smt_tmk))

So, we will finalize the SMOTE+TOMEK model as it's giving the better result across all the metrics

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

In [None]:
#train and test set prediction
y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)
print("Train set accuracy:",metrics.accuracy_score(y_train, y_pred_train))
print("Test set accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
feature_imp = pd.Series(clf.feature_importances_,index=X_train.columns).sort_values(ascending=False)
feature_imp.nlargest(20)

Top 20 features as per the feature-importance of Random Forest model

In [None]:
#Feature Importance using Recursive Feature Elimination
from sklearn.feature_selection import RFE
predictors = X_resampled_smt_tmk
selector = RFE(lreg_smt_tmk, n_features_to_select=10)
selector = selector.fit(predictors, y_resampled_smt_tmk)

In [None]:
print(selector.support_)
print(selector.ranking_)

In [None]:
order = selector.ranking_
order

In [None]:
X_resampled_smt_tmk.columns[selector.support_]

**Finalize the model**

In [None]:
X = data[['age',
       'payment_term_50% advance payment and 50% upon receiving the shipment',
       'payment_term_eom', 'payment_term_lcsight',
       'payment_term_on consignment', 'invoice_currency_code_eur',
       'invoice_currency_code_gbp', 'invoice_currency_code_kwd',
       'invoice_currency_code_qar', 'invoice_type_non goods']]
y = data[['target']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
num_vars = ['age']

In [None]:
data[num_vars] = scaler.fit_transform(data[num_vars])

In [None]:
scaler.fit(X_train[['age']])
X_train[['age']] = scaler.transform(X_train[['age']])

In [None]:
X_test[['age']] = scaler.transform(X_test[['age']])

In [None]:
#SMOTE+TOMEK
from imblearn.combine import SMOTETomek
smt_tmk = SMOTETomek(random_state=45)
X_resampled_smt_tmk, y_resampled_smt_tmk = smt_tmk.fit_resample(X_train, y_train)
len(X_resampled_smt_tmk)

In [None]:
lreg_smt_tmk = LogisticRegression()
lreg_smt_tmk.fit(X_resampled_smt_tmk, y_resampled_smt_tmk)

y_pred_smt_tmk = lreg_smt_tmk.predict(X_test)

In [None]:
from sklearn import metrics
print ('Accuracy: ', metrics.accuracy_score(y_test, y_pred_smt_tmk))
print ('F1 score: ', metrics.f1_score(y_test, y_pred_smt_tmk))
print ('Recall: ', metrics.recall_score(y_test, y_pred_smt_tmk))
print ('Precision: ', metrics.precision_score(y_test, y_pred_smt_tmk))
print ('\n clasification report:\n', metrics.classification_report(y_test,y_pred_smt_tmk))
print ('\n confussion matrix:\n',metrics.confusion_matrix(y_test, y_pred_smt_tmk))

In [None]:
y_pred_proba = lreg_smt_tmk.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
#plt.figure(figsize=(16,8))
#plt.plot(fpr,tpr,label="AUC-ROC Curve, auc="+str(auc))
#plt.legend(loc=4) 
#plt.show()
plt.subplots(1, figsize=(16,8))
plt.title('Receiver Operating Characteristic - Logistic regression')
plt.plot(fpr, tpr,label="AUC-ROC Curve, auc="+str(auc))
plt.plot([0, 1], ls="--")
plt.plot([0, 0], [1, 0] , c=".7"), plt.plot([1, 1] , c=".7")
plt.ylabel('Sensitivity(True Positive Rate)')
plt.xlabel('1-Specificity(False Positive Rate)')
plt.legend(loc=4)
plt.show()

In [None]:
print("auc-roc score: ", metrics.roc_auc_score(y_test, y_pred_proba))

So, we can observe that all score of the metrics got improved in this finalized model

**Unseen Data: Model Evaluation**

In [None]:
# Importing the dataset and looking at the first 6 records
test = pd.read_csv("G:/M/Domain Oriented Case Study/E-Commerce & Retail B2B Case Study/Open_Invoice_data.csv",encoding='latin1')
test.head()

In [None]:
df = pd.DataFrame()
df['Customer No.'] = test['Customer Account No'].astype(str)

In [None]:
test = test.drop(['AS_OF_DATE','Transaction Number'],axis=1)

In [None]:
test['invoice creation date'] = np.where(test['INV_CREATION_DATE'].str.contains('/'), pd.to_datetime(test['INV_CREATION_DATE']).dt.strftime('%m/%d/%Y'), pd.to_datetime(test['INV_CREATION_DATE'], dayfirst=True).dt.strftime('%m/%d/%Y'))

In [None]:
test.sample(5)

In [None]:
test[['invoice creation date','Due Date']] = test[['invoice creation date','Due Date']].apply(pd.to_datetime)

In [None]:
test['target'] = np.where(test['invoice creation date']>test['Due Date'], 0, 1)

In [None]:
test.target.value_counts()

In [None]:
df['actual'] = test['target']

In [None]:
test.isnull().sum()

In [None]:
test = test.dropna()

In [None]:
test.info()

In [None]:
test['USD Amount'] = test['USD Amount'].str.replace(',', '').astype(float)

In [None]:
test['USD Amount'].sample(5)

In [None]:
test = test.drop(['Customer_Name','Customer Account No','Transaction Date','Due Date','Local Amount','INV_CREATION_DATE','invoice creation date','target'],axis=1)

In [None]:
test.head()

In [None]:
test['Transaction Class'] = test['Transaction Class'].map({'CREDIT NOTE': 'CM', 'DEBIT NOTE':'DM', 'INVOICE':'INV', 'PAYMENT':'CM'})

In [None]:
test.info()

In [None]:
test = test.applymap(lambda s: s.lower() if type(s) == str else s)

In [None]:
test.columns= test.columns.str.lower()

In [None]:
numeric_data = test.select_dtypes(include=[np.number])
categorical_data = test.select_dtypes(exclude=[np.number])

In [None]:
categorical_data.columns

In [None]:
categorical_data.rename(columns = {'payment term':'payment_term','transaction class':'invoice_class','transaction currency':'invoice_currency_code'}, inplace = True)

In [None]:
categorical_data.columns

In [None]:
test.rename(columns = {'payment term':'payment_term','transaction class':'invoice_class','transaction currency':'invoice_currency_code'}, inplace = True)

In [None]:
encoded_cols = pd.get_dummies(test[categorical_data.columns], drop_first=True)

In [None]:
numeric_data.columns

In [None]:
numeric_data.rename(columns = {'usd amount':'cbrt_usd_amount'}, inplace=True)

In [None]:
numeric_data.columns

In [None]:
test_data = pd.concat([numeric_data,encoded_cols], axis=1)

In [None]:
test_data.shape

In [None]:
unseen = test_data[['age',
       'payment_term_50% advance payment and 50% upon receiving the shipment',
       'payment_term_eom', 'payment_term_lcsight',
       'payment_term_on consignment', 'invoice_currency_code_eur',
       'invoice_currency_code_gbp', 'invoice_currency_code_kwd',
       'invoice_currency_code_qar', 'invoice_type_non goods']]

In [None]:
X_Predict = lreg_smt_tmk.predict(unseen)

In [None]:
lreg_smt_tmk.intercept_

In [None]:
lreg_smt_tmk.coef_

In [None]:
lreg_smt_tmk.predict_proba(unseen)

In [None]:
result=pd.DataFrame(data=X_Predict, index=unseen.index, columns=['score'])

In [None]:
result.head(5)

In [None]:
result['predicted_probabilities'] = lreg_smt_tmk.predict_proba(unseen)[:,1]

In [None]:
result['is_delayed'] = np.where(result['predicted_probabilities'] >= 0.7,"yes","no")

In [None]:
result['Cust id'] = df['Customer No.']
result['Cust id'] = result['Cust id'].astype('str').str.replace(r".0", r"", regex=False)

In [None]:
result['actual'] = df['actual']

In [None]:
result.rename(columns = {'score':'predicted'}, inplace=True)

In [None]:
expected_result = result[['Cust id','actual','predicted','is_delayed']]

In [None]:
expected_result.sample(10)

The above summary table ensures that the probability of late payment is aggregated at a customer level.
For example, we can see for customer id 20187, we predicted that the customer will be delayed with more than 70% probabilities.

In [None]:
expected_result[expected_result['is_delayed']=="yes"]

Finally, we can see that there are 28287 customers out of 88201 unseen records are predicted as delayed customers.

**Top 10 factors / important predictors**

age
payment_term_50% advance payment and 50% upon receiving the shipment
payment_term_eom
payment_term_lcsight
payment_term_on consignment
invoice_currency_code_eur
invoice_currency_code_gbp
invoice_currency_code_kwd
invoice_currency_code_qar
invoice_type_non goods

**Recommendations**

1. We should focus more on the time difference between Due Date and Invoice Payment Date
2. Payment terms: 50% advance payment and 50% upon receiving the shipment, eom, lcsight and on consignment variables need to be considered with greater attention.
3. Where the invoice currency codes are eur, gbp, kwd and qar, the risk is higher of delay payment.
4. Invoice type non-goods has lower impact than Goods invoice type in delayed payment.