### Receipt - Transaction Ranking Model

Problem Statement:
To rank the transaction based on the relevance to the uploaded receipt.

Approach: 
Based on the initial analysis, we have only the matched transaction and feature transaction rank attribute vector. But to do list level ranking, we would require recipet-transaction ranking to generate the ranking.

So I have break down the problem into two phases 

Phase 1: Generate the relevance score generation using Tree based algorithms 
Target Variable: Matched (If matched_txn == featured_txn then 1 else 0)
Independent Variable: Similiarity Attribute Vector
Algorithm Tested: XGBoost, Random Forest. 
Evaluation Metrics : AUC 

Phase 2: Generate the relevance rank for the transaction list for a given receipt 
Target Variable: Reciept Level Transaction ranking (Derived from the prob score from Phase I)
Independent Variable: Similiarity Attribute Vector 
Algorithm Tested: LightGBM - lambda rank 
purpose, I have adopted the same.
Evaluation Metrics : nDCG and MRR 

Project Structure : 
1. src - Folder is Parent Folder for the Project 
2. Notebook - Folder has the Basic EDA, Phase 1 and 2 Implementation testing 
3. src/components Folder has following 
    1. data_ingestion.py - It handles the data ingestion. Could be extended to db extraction 
    2. dt_relevance_score.py - It handles the data transformation work for the Phase I Scoring 
    3. dt_ranker.py - It handles the data transformation for the Phase II 
    4. model_trainer_relevance_score - It handles the model training, grid search and best model selection and prediction for relevance Scoring 
    5. model_trainer_trans_ranker.py - It handles the model traning for the ranking the transaction. 
4. artifacts - Folder has 
    1. Model - Relevance Scorer and Ranker 
    2. Preprocessor Pipeline - For both Phases 

5. Pipeline - Folder has the placeholder for training and prediction pipeline
6. Utils Folder has the files related to logging, exception handling 

Explanation for Choice: 
1. Phase I
    -  I have used Bagging and Boosting models along with Grid Search to fine tune the hyper parameter and predict the relevance score for each transaction based on the attribute vector. 

    - I have used the AUC evaluation metrics since it can handle imbalance. Relevance Scoring model has AUC around 0.8 and closer the value to 1 better the peformance. Our model is delivering good performance. The performance can be improved using additional dataset and features. 

2. Phase II

    - I have leveraged the LightGBM based lambdarank algo for listwise ranking. Lamddarank performs really well on the listwise ranking problems. 
    - Reason for Metrics: 
    1. nDCG - Provides a normalized view on the performance of the model. nDCG = 0.99 describes that in relation to original ranking order, model is able to rank more accurately.

    2. MRR - If we could show the first relevant transaction effectively, higher chance for the customer engagement and experience. MRR = 0.75 can proves the model is able to identify top most relevant transaction.

Next Steps: 
1. Improve the Relevance score generator model based additional features and large dataset
2. Add LightGBM wrapper to handle the GridSearch/Hyperparameter tuning 
3. Explore the deep learning based ranking algorithm - ListNet , Listwise ANN
3. Add in additional features improve the ranker model performance 



In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import os
import sys 

In [4]:
#Setting up the Project directory
print("Current working directory:", os.getcwd())
os.chdir('/Users/yohan/Desktop/GitRepos/Tide/TideWorks')
print("New working directory:", os.getcwd())

Current working directory: /Users/yohan/Desktop/GitRepos/Tide/TideWorks/Notebooks
New working directory: /Users/yohan/Desktop/GitRepos/Tide/TideWorks


In [5]:
from src.components.data_ingestion import DataIngestion
from src.components.dt_relevance_score import DataTransformation
from src.components.model_trainer_relevance_score import ModelTrainer
from src.components.dt_ranker import RankerDataTransformation
from src.components.model_trainer_trans_ranker import RankerModelTrainer

In [6]:
#Variables are set constants, It will be read from config files
SEED = 42
TARGET_COLUMN = 'matched'
CORRELATED_DROP_COLUMNS = ['DifferentPredictedTime','DifferentPredictedDate']
REMOVE_COLUMNS = ['receipt_id','company_id','matched_transaction_id','feature_transaction_id']
CAT_COLUMNS = ['DifferentPredictedTime','TimeMappingMatch','ShortNameMatch''DifferentPredictedDate','PredictedTimeCloseMatch']
NUM_COLUMNS = ['DateMappingMatch', 'AmountMappingMatch', 'DescriptionMatch', 'PredictedNameMatch', 'PredictedAmountMatch']

In [7]:
#Initializing the data ingestion
di = DataIngestion()
train_path,test_path = di.runDataIngestion()

In [8]:
#Initiating the data transformation for scoring
dt = DataTransformation(TARGET_COLUMN
,CORRELATED_DROP_COLUMNS
,REMOVE_COLUMNS
,CAT_COLUMNS
,NUM_COLUMNS
,train_path
,test_path
)
train_arr,test_arr,prePipePath = dt.runPreprocessor()
    
    

In [9]:
#Generating the transaction relevance score
mt = ModelTrainer(train_arr,test_arr,train_path,test_path)
aucScore,ranker_train_path,ranker_test_path = mt.getBestModel()
print(aucScore)
print(ranker_train_path)

0.8121999699969996
artifacts/train_with_score.pkl


In [10]:
#Initiating the data transformation for ranking
dtr = RankerDataTransformation(
                             CORRELATED_DROP_COLUMNS
                            ,REMOVE_COLUMNS
                            ,CAT_COLUMNS
                            ,NUM_COLUMNS
                            ,ranker_train_path
                            ,ranker_test_path)
    

train_arr,test_arr,qids_train,qids_test,prePipePathRanker = dtr.runPreprocessor()


In [11]:
#Generating the receipt and transaction list level ranking
ranker = RankerModelTrainer(
         train_arr
        ,test_arr
        ,qids_train 
        ,qids_test 
        ,ranker_train_path 
        ,ranker_test_path
    )
    
test_df = ranker.runModelTrainer()

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000991 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 39
[LightGBM] [Info] Number of data points in the train set: 9445, number of used features: 9
NDCG Score for the ranker: 0.9992772653887911
Mean Reciprocal Rank (MRR): 0.7748917748917749


In [12]:
test_df.head()

Unnamed: 0,receipt_id,company_id,matched_transaction_id,feature_transaction_id,DateMappingMatch,AmountMappingMatch,DescriptionMatch,DifferentPredictedTime,TimeMappingMatch,PredictedNameMatch,ShortNameMatch,DifferentPredictedDate,PredictedAmountMatch,PredictedTimeCloseMatch,matched,relevance_score,relevance_rank,pred_rs,pred_relevance_rank
34,10003,10000,10412,10140,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
35,10003,10000,10412,10141,0.55,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,-0.42948,1
36,10003,10000,10412,10410,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
37,10003,10000,10412,10411,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0,0.0,1,-0.42948,1
40,10003,10000,10412,10414,0.85,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,-0.42948,1


In [13]:
test_df[test_df.receipt_id == '10,003'].sort_values(by = ['relevance_score'],ascending = [False])

Unnamed: 0,receipt_id,company_id,matched_transaction_id,feature_transaction_id,DateMappingMatch,AmountMappingMatch,DescriptionMatch,DifferentPredictedTime,TimeMappingMatch,PredictedNameMatch,ShortNameMatch,DifferentPredictedDate,PredictedAmountMatch,PredictedTimeCloseMatch,matched,relevance_score,relevance_rank,pred_rs,pred_relevance_rank
39,10003,10000,10412,10413,0.85,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0,1.0,2,0.38791,2
38,10003,10000,10412,10412,0.85,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1,1.0,2,0.406053,3
34,10003,10000,10412,10140,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
35,10003,10000,10412,10141,0.55,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,-0.42948,1
36,10003,10000,10412,10410,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
37,10003,10000,10412,10411,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0,0.0,1,-0.42948,1
40,10003,10000,10412,10414,0.85,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,-0.42948,1
41,10003,10000,10412,10415,0.85,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,-0.42948,1


In [14]:
test_df[test_df.receipt_id == '30,015'].sort_values(by = ['relevance_score'],ascending = [False])

Unnamed: 0,receipt_id,company_id,matched_transaction_id,feature_transaction_id,DateMappingMatch,AmountMappingMatch,DescriptionMatch,DifferentPredictedTime,TimeMappingMatch,PredictedNameMatch,ShortNameMatch,DifferentPredictedDate,PredictedAmountMatch,PredictedTimeCloseMatch,matched,relevance_score,relevance_rank,pred_rs,pred_relevance_rank
4883,30015,30000,30831,30831,0.95,0.0,0.0,1.0,0.0,0.8,0.0,0.0,0.0,0.0,1,1.0,2,0.394841,2
4878,30015,30000,30831,30821,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
4879,30015,30000,30831,30822,0.0,0.0,0.0,1.0,0.0,0.4,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
4880,30015,30000,30831,30823,0.0,0.0,0.0,1.0,0.0,0.4,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
4881,30015,30000,30831,30824,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.0,1,-0.42948,1
4882,30015,30000,30831,30829,0.95,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,1,-0.42948,1
