### Buyer vs Supplier address classification
- If we see the data, buyer and supplier both are the basically the address text only. 
- from a text feature point its very hard to tell which is buyer and supplier address, as they are essentially the same, except for in some cases where we might have few keyphrases to hel pthe classification.
- as such, maybe the only thing that might be very useful is the position information of these two.
- in this notebook, I will attempt to use the position info and see if we can get somewhere with it. (although, this might require a lot more samples to generalize well)


In [10]:
from src import DataStats
from pprint import pprint

from src import BoxBasedTagger
from src import check_repeated_data
from src.utility import jsonl_reader
from src.utility import inspection_full_matching
from src.utility import inspection_partial_matching

from sklearn.utils import shuffle
from src import FeatureGeneration
from src.trainer import train_model
from sklearn.model_selection import train_test_split as tts
from src.utility import print_classifaction_report, save_model



In [11]:
dataset = jsonl_reader('dataset/sample_dataset.jsonl')

In [12]:
#creating a weak supervised dataset based on boundbox
tagger = BoxBasedTagger(dataset, thresold=0.30)
supervised_data = tagger.result
supervised_data.groupby('target').nunique()

Unnamed: 0_level_0,text,x1,y1,x2,y2,target,line_id,doc_idx
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
both,6,7,7,7,7,1,7,3
buyer-address,166,163,172,173,172,1,13,78
non-addressline,288,302,307,305,306,1,18,103
vendor-address,136,139,143,143,144,1,9,71


In [13]:
supervised_data = supervised_data.query(f'target != "both"')
supervised_data.groupby('target').nunique()

I0725 21:14:51.235091 140268661499712 utils.py:129] Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
I0725 21:14:51.235593 140268661499712 utils.py:141] NumExpr defaulting to 8 threads.


Unnamed: 0_level_0,text,x1,y1,x2,y2,target,line_id,doc_idx
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
buyer-address,166,163,172,173,172,1,13,78
non-addressline,288,302,307,305,306,1,18,103
vendor-address,136,139,143,143,144,1,9,71


In [14]:
imp_columns = ['x1','y1','x2','y2', 'line_id']

In [15]:
def train_test_split(dataset, imp_columns):
    dataset = shuffle(dataset)
    train_data, test_data = tts(dataset,random_state=23)
    X_test , y_test = test_data[imp_columns], test_data.target
    X_train , y_train = train_data[imp_columns], train_data.target
    return X_train, y_train, X_test, y_test, train_data, test_data

In [16]:
X_train, y_train, X_test, y_test, train_data, test_data = train_test_split(supervised_data, imp_columns)

In [15]:
rf_model = train_model('random_forest',X_train, y_train)
print_classifaction_report(rf_model, X_test, y_test)
save_model(rf_model, 'models/rfc_address_classifier.pkl')

[[32  8  0]
 [16 41 14]
 [ 1 18 20]]
                 precision    recall  f1-score   support

  buyer-address       0.65      0.80      0.72        40
non-addressline       0.61      0.58      0.59        71
 vendor-address       0.59      0.51      0.55        39

       accuracy                           0.62       150
      macro avg       0.62      0.63      0.62       150
   weighted avg       0.62      0.62      0.62       150





In [14]:
svc_model = train_model('svm',X_train, y_train)
print_classifaction_report(rf_model, X_test, y_test)
save_model(svc_model, 'models/svm_address_classifier.pkl')

[[34  5  1]
 [20 39 12]
 [ 6 17 16]]
                 precision    recall  f1-score   support

  buyer-address       0.57      0.85      0.68        40
non-addressline       0.64      0.55      0.59        71
 vendor-address       0.55      0.41      0.47        39

       accuracy                           0.59       150
      macro avg       0.59      0.60      0.58       150
   weighted avg       0.60      0.59      0.58       150





In [31]:
test_data['rfc_prediction'] = rfc_model.predict(test_data[imp_columns])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [34]:
test_data.query(f'rfc_prediction != target')[['text','line_id', 'target', 'rfc_prediction']].head(10)

Unnamed: 0,text,line_id,target,rfc_prediction
2,Mississippi 38860,3,vendor-address,non-addressline
3,www.nespack.com,8,non-addressline,buyer-address
3,GST Registration No. : 20-05029 1-4R Fox +65 6...,6,non-addressline,buyer-address
1,04742 Tel: 6124.0126 Facsimile: 414-9789 $15 S...,4,vendor-address,non-addressline
1,541 ORCHID ROAD #05-03 (MARINA BAY SANDS) ORIG...,2,buyer-address,vendor-address
8,İTEM DESCRIPTION QTY UNIT PRICE AMOUNT,10,non-addressline,buyer-address
1,"44 Jalans Buraho, TAX INVOICE NO. 1730134SGP",4,vendor-address,non-addressline
3,"WAVA Tower, California 658565 Tel :",4,vendor-address,non-addressline
1,1 WOODLAND LANE DELIVERY TO: NO. : 15392,2,buyer-address,vendor-address
4,www.auratot.com.sg Pages: 1,6,non-addressline,buyer-address


## Observations

1. The main case where the model seems to be failing is wrt partial-address lines.
2. A potential approach to improve this model would be to add sorrounding text features to the address-bounding box.
3. Augmenting this data is a bit challenging, but with a bit more time, it might be possible.
4. There are 2 models above that I have experimented with, overall the RF seems to be better. How ever when you look at the recall, SVM does seem to be better wrt buyer address.
5. overall I think we could improve the results of this classifier further, by adding a bit more data & a bit more feature engineering.
6. Also, the other thing that we can try is if we have a model that can jointly learn(sharing weights) to detect address lines and also classify which address is which it might be quite interesting. (this is based on the observation that both classifiers are trying to model/capture different patterns, especially where we have partial-address lines.)