In [1]:
import sys
sys.path.insert(0, '..')

from formasaurus import formtype_features as features
from formasaurus import evaluation
from formasaurus.storage import Storage, load_html

### Available training data

In [2]:
storage = Storage("../formasaurus/data")
storage.check()
storage.print_form_type_counts(simplify=True)

Checking: 100%|##########################| 696/696 [00:02<00:00, 247.70 files/s]
                                          

Status: OK
Annotated HTML forms (simplified classes):

317   search                    (s)
198   login                     (l)
133   registration              (r)
112   other                     (o)
109   contact/comment           (c)
86    password/login recovery   (p)
83    join mailing list         (m)
37    order/add to cart         (b)

Total form count: 1075




## Load training / evaluation data

We must be careful when splitting the dataset into training and evaluation parts: forms from the same domain should be in the same "bin". There could be several pages from the same domain, and these pages may have duplicate or similar forms (e.g. a search form on each page). If we put one such form in training dataset and another in evaluation dataset then the metrics will be too optimistic, and they can make us to choose wrong features/models. For example, train_test_split from scikit-learn shouldn't be used here.

As an approximation, data is sorted by a domain below (it is done in storage.get_Xy function), and then it is split into 2 parts. It means that no more than 1 domain is prone to overfitting (the one we're splitting at).

In [3]:
TRAIN_SIZE = 700

X, y = storage.get_Xy(verbose=True, leave=True)
X_train, X_test, y_train, y_test = X[:TRAIN_SIZE], X[TRAIN_SIZE:], y[:TRAIN_SIZE], y[TRAIN_SIZE:]

print("Train size: %d, test size: %d" % (len(y_train), len(y_test)))

Loading: 696 files [00:02, 247.61 files/s]


Train size: 700, test size: 375





## Ideas for useful features

### Search forms

* a single query field
* a field named "q" or "s"
* "search" in URL
* "search" in submit button text (submit value)
* "search" in form css class or id
* no password field
* method == GET?

### Login forms

* username or email and password
* 2 passwords - likely not a login form
* "login" or "sign in" (or variations) in URL
* "login" or "sign in" (or variations) in form css class or id
* "login" or "sign in" in submit button text
* "Remember me" checkbox (or any single checkbox)
* no select elements
* no textarea elements
* openid?
* method == POST

### Registration forms

* 2 passwords 
* "register" / "sign up" in URL, form css class / id or submit button text
* "agree" checkbox
* email
* username
* method == POST

### Contact forms

* feedback in URL/class
* textarea
* "Send" button
* email
* method == POST

### Password reset

* a single email or username field
* "password" in URL/css class/ submit button text

### Join Mailing List

* a single email field
* subscribe/join/newsletter words
* a short form

The main problem with "join mailing list" forms is to distinguish them from search forms.

## How to handle them

Instead of hardcoding the features above many of them are generalized. For exmaple, instead of writing "search in URL" we extract all 5-character substrings from the URL and use "`urlsubstring<N>` in URL" as features. There are some disadvantages in this approach, but it provides a good starting point.

The feature extractors are stored in formtype.features module.

In [4]:
from scipy import stats as st
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support

In [5]:
%%time
# reload(features)
from formasaurus.formtype_model import _create_feature_union
# ======= define the model ========

# features should be kept in sync with formasaurus.formtype_model
# a list of 3-tuples with default features:
# (feature_name, form_transformer, vectorizer)
FEATURES = [
    (
        "bias",
        features.Bias(),
        DictVectorizer(),
    ),
    (
        "form elements",
        features.FormElements(),
        DictVectorizer()
    ),
    (
        "<input type=submit value=...>",
        features.SubmitText(),
        CountVectorizer(ngram_range=(1,2), min_df=1, binary=True)
    ),
    (
        "<a> TEXT </a>",
        features.FormLinksText(),
        TfidfVectorizer(ngram_range=(1,2), min_df=4, binary=True,
                        stop_words={'and', 'or', 'of'})
    ),
    (
        "<label> TEXT </label>",
        features.FormLabelText(),
        TfidfVectorizer(ngram_range=(1,2), min_df=3, binary=True,
                        stop_words="english")
    ),

    (
        "<form action=...>",
        features.FormUrl(),
        TfidfVectorizer(ngram_range=(5,6), min_df=4, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<form class=... id=...>",
        features.FormCss(),
        TfidfVectorizer(ngram_range=(4,5), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input class=... id=...>",
        features.FormInputCss(),
        TfidfVectorizer(ngram_range=(4,5), min_df=5, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input name=...>",
        features.FormInputNames(),
        TfidfVectorizer(ngram_range=(5,6), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input title=...>",
        features.FormInputTitle(),
        TfidfVectorizer(ngram_range=(5,6), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
]


# clf = SGDClassifier(
#     penalty='elasticnet', 
#     loss='log', 
#     alpha=0.0002,
#     fit_intercept=False, 
#     shuffle=True, 
#     random_state=0,
#     n_iter=50,
# )
clf = LogisticRegression(penalty='l2', C=5, fit_intercept=False, random_state=0, tol=0.01)

# clf = LinearSVC(C=0.5, random_state=0, fit_intercept=False)
model = Pipeline([
    ('fe', _create_feature_union(FEATURES)),
    ('clf', clf),
])

evaluation.print_metrics(model, X, y, X_train, X_test, y_train, y_test, ipython=True)


Classification report (700 training examples, 375 testing examples):

             precision    recall  f1-score   support

          b       0.75      0.43      0.55         7
          c       0.75      0.94      0.83        32
          l       0.97      0.90      0.94        73
          m       0.82      0.88      0.85        32
          o       0.66      0.62      0.64        47
          p       0.93      0.81      0.86        31
          r       0.95      0.87      0.91        47
          s       0.90      0.98      0.94       106

avg / total       0.87      0.87      0.87       375

Active features: 48968 out of possible 48968

Confusion matrix (rows=>true values, columns=>predicted values):


Unnamed: 0,b,c,l,m,o,p,r,s
b,3,1,0,0,3,0,0,0
c,0,30,0,0,2,0,0,0
l,0,0,66,1,6,0,0,0
m,0,2,0,28,0,0,1,1
o,1,7,0,0,29,1,0,9
p,0,0,0,4,0,25,1,1
r,0,0,2,1,2,1,41,0
s,0,0,0,0,2,0,0,104



Running cross validation...
10-fold cross-validation F1: 0.886 (±0.058)  min=0.836  max=0.924
CPU times: user 11.9 s, sys: 67.4 ms, total: 12 s
Wall time: 12.1 s


  'precision', 'predicted', average, warn_for)


## Check what classifier learned

For linear classifiers like Logistic Regression or an SVM without a kernel we can check coefficient values to understand better how the decision is made. 

For correlated features weight will be spread across all correlated features, so just checking coefficients is not enough, but looking at them is useful anyways.

In [6]:
evaluation.print_informative_features(FEATURES, clf, 25)

b
+2.5555:               <input name=...>   qty 
+1.8767:              <form action=...>   prod
+1.8762:                  <a> TEXT </a>  email
+1.2177:               <input name=...>  oncode
+1.2177:               <input name=...>  oncod
+1.1715:               <input name=...>  ncode 
+1.1715:               <input name=...>  ncode
+1.1526:       <input class=... id=...>   qty
+1.1463:  <input type=submit value=...>  заказ
+1.1463:  <input type=submit value=...>  оплатить заказ
+1.1463:  <input type=submit value=...>  оплатить
+1.1309:                  form elements  has <select>
+1.0815:              <form action=...>  .asp#
+1.0815:              <form action=...>  asp# 
+1.0815:              <form action=...>  .asp# 
+1.0750:               <input name=...>  quant
+1.0679:              <input title=...>   add 
+1.0518:               <input name=...>   order
+1.0518:               <input name=...>  order
+1.0518:               <input name=...>   orde
+1.0479:  <input type=submit value=.

## Compare results with "loginform" library

It is not possible to compare the results with "loginform" library directly because loginform

* always tries to return a login form even if the score is low;
* only detects login forms;
* in case of several forms returns a single form with the best score instead of deciding for each form whether to return it or not.

So we used two approaches:

1. Use `loginform._form_score` with different thresholds; assume that if score is greater than or equal to a threshold `loginform` detected a login form.
2. Train the same model, but using features from loginform library (weights will be learned instead of being hardcoded as 'score' increments/decrements).



### 1. loginform scores + thresholds

* **score >= -100** means "simply treat all forms as login forms".

* **score >= 0** all (or most) login forms are captured, but there are many false positives. 
  It is only slightly better than treating all forms as login forms.

* **score >= 10** F1 score is the best among all thresholds, 
  but the quality is significantly worse than F1 of ML-based models.
  
* **score >= 20** ~90% of detected login forms are correct, but most 
  login forms are not detected. Also, ~90% number is still lower than what ML-based models give us.


In [7]:
%%time 
import loginform


def labels_to_binary(y):
    """ Convert labels to 2-classes: login forms and non-login forms """
    return [tp == 'l' for tp in y]

    
def predict_loginform(X, threshold):
    """
    Return if forms are login or not using loginform
    library scores and a threshold.
    """
    return [
        (loginform._form_score(form) >= threshold)
        for form in X
    ]


def print_threshold_metrics(X_test, y_test, threshold):
    y_test = labels_to_binary(y_test)
    y_pred = predict_loginform(X_test, threshold)

    precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred, pos_label=True)
    print(
        "score >= %4d:    precision = %0.3f    recall = %0.3f    F1 = %0.3f" % (
        threshold, precision[1], recall[1], f1[1]
    ))


for threshold in [-100, -10, 0, 10, 20, 30]:
    print_threshold_metrics(X_test, y_test, threshold)

score >= -100:    precision = 0.195    recall = 1.000    F1 = 0.326
score >=  -10:    precision = 0.213    recall = 0.918    F1 = 0.346
score >=    0:    precision = 0.330    recall = 0.918    F1 = 0.486
score >=   10:    precision = 0.675    recall = 0.767    F1 = 0.718
score >=   20:    precision = 0.840    recall = 0.288    F1 = 0.429
score >=   30:    precision = 1.000    recall = 0.014    F1 = 0.027
CPU times: user 667 ms, sys: 4.26 ms, total: 671 ms
Wall time: 674 ms


  'precision', 'predicted', average, warn_for)


### 2. Use loginform features, but autodetect scores

The following ML-based model is trained using original loginform features (conditions used to increase or decrease the score). Roughly speaking, it uses the same information as loginform library, but instead of hardcoding `score += 10` and `score -= 10` the numbers are adjusted based on training data.

Note that the login form detection quality is significantly better than the quality of threshold-based model; it is only slightly worse than the quality of a "full" forms detection model. This means original loginform features are quite good at detecting login forms. But for other form types these features are not enough: other scores are bad.

In [8]:
%%time

LOGINFORM_FEATURES = [
    ('bias', features.Bias(), DictVectorizer()),
    ('loginform', features.OldLoginformFeatures(), DictVectorizer())
]
# loginform_clf = LinearSVC(C=0.5, fit_intercept=False)
loginform_clf = LogisticRegression(penalty='l2', C=5, fit_intercept=False, random_state=0)

model = make_pipeline(
    _create_feature_union(LOGINFORM_FEATURES), 
    loginform_clf,
)

evaluation.print_metrics(model, X, y, X_train, X_test, y_train, y_test, ipython=True)
evaluation.print_informative_features(LOGINFORM_FEATURES, loginform_clf, 25)


Classification report (700 training examples, 375 testing examples):

             precision    recall  f1-score   support

          b       0.00      0.00      0.00         7
          c       0.43      0.81      0.57        32
          l       0.92      0.89      0.90        73
          m       0.00      0.00      0.00        32
          o       0.43      0.62      0.51        47
          p       0.00      0.00      0.00        31
          r       0.94      0.70      0.80        47
          s       0.57      0.76      0.65       106

avg / total       0.55      0.62      0.57       375

Active features: 64 out of possible 64

Confusion matrix (rows=>true values, columns=>predicted values):


  'precision', 'predicted', average, warn_for)


Unnamed: 0,b,c,l,m,o,p,r,s
b,0,1,0,0,4,0,0,2
c,0,26,0,0,2,0,1,3
l,0,1,65,0,6,0,0,1
m,0,4,0,0,3,0,0,25
o,0,12,0,0,29,0,0,6
p,0,2,0,0,5,0,1,23
r,0,4,6,0,3,0,33,1
s,0,10,0,0,15,0,0,81


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Running cross validation...
10-fold cross-validation F1: 0.573 (±0.100)  min=0.502  max=0.645
b
+1.3924:                      loginform  typecount_password_0
+0.5631:                      loginform  typecount_radio_gt0
-0.6217:                      loginform  typecount_text_gt1
-0.6524:                      loginform  2_or_3_inputs
-0.7379:                      loginform  typecount_checkbox_gt1
-0.8396:                      loginform  typecount_text_0
-1.8808:                      loginform  typecount_password_eq1
-3.7685:                           bias  bias
--------------------------------------------------------------------------------
c
+3.2011:                      loginform  typecount_password_0
+2.8215:                      loginform  typecount_text_gt1
+0.0203:                      loginform  typecount_text_0
-0.3697:                      loginform  typecount_radio_gt0
-1.2807:                      loginform  typecount_checkbox_gt1
-1.5019:                      loginform  type