In [1]:
import sys
sys.path.insert(0, '..')

from formasaurus import formtype_features as features
from formasaurus import evaluation
from formasaurus.storage import Storage, load_html

### Available training data

In [2]:
storage = Storage("../formasaurus/data")
storage.check()
storage.print_form_type_counts()

Checking: 100%|##########################| 343/343 [00:01<00:00, 215.33 files/s]
                                          

Status: OK
Annotated HTML forms:

177   search                    (s)
133   login                     (l)
113   other                     (o)
88    registration              (r)
50    password/login recovery   (p)
37    join mailing list         (m)
36    contact                   (c)

Total form count: 634




## Load training / evaluation data

We must be careful when splitting the dataset into training and evaluation parts: forms from the same domain should be in the same "bin". There could be several pages from the same domain, and these pages may have duplicate or similar forms (e.g. a search form on each page). If we put one such form in training dataset and another in evaluation dataset then the metrics will be too optimistic, and they can make us to choose wrong features/models. For example, train_test_split from scikit-learn shouldn't be used here.

As an approximation, data is sorted by a domain below (it is done in storage.get_Xy function), and then it is split into 2 parts. It means that no more than 1 domain is prone to overfitting (the one we're splitting at).

In [3]:
TRAIN_SIZE = 500

X, y = storage.get_Xy(verbose=True, leave=True)
X_train, X_test, y_train, y_test = X[:TRAIN_SIZE], X[TRAIN_SIZE:], y[:TRAIN_SIZE], y[TRAIN_SIZE:]

print("Train size: %d, test size: %d" % (len(y_train), len(y_test)))

Loading: 343 files [00:01, 233.10 files/s]


Train size: 500, test size: 134





## Ideas for useful features

### Search forms

* a single query field
* a field named "q" or "s"
* "search" in URL
* "search" in submit button text (submit value)
* "search" in form css class or id
* no password field
* method == GET?

### Login forms

* username or email and password
* 2 passwords - likely not a login form
* "login" or "sign in" (or variations) in URL
* "login" or "sign in" (or variations) in form css class or id
* "login" or "sign in" in submit button text
* "Remember me" checkbox (or any single checkbox)
* no select elements
* no textarea elements
* openid?
* method == POST

### Registration forms

* 2 passwords 
* "register" / "sign up" in URL, form css class / id or submit button text
* "agree" checkbox
* email
* username
* method == POST

### Contact forms

* feedback in URL/class
* textarea
* "Send" button
* email
* method == POST

### Password reset

* a single email or username field
* "password" in URL/css class/ submit button text

### Join Mailing List

* a single email field
* subscribe/join/newsletter words
* a short form

The main problem with "join mailing list" forms is to distinguish them from search forms.

## How to handle them

Instead of hardcoding the features above many of them are generalized. For exmaple, instead of writing "search in URL" we extract all 5-character substrings from the URL and use "`urlsubstring<N>` in URL" as features. There are some disadvantages in this approach, but it provides a good starting point.

The feature extractors are stored in formtype.features module.

In [4]:
from scipy import stats as st
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import LinearSVC
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support

In [5]:
%%time
# reload(features)
from formasaurus.formtype_model import _create_feature_union
# ======= define the model ========

# features should be kept in sync with formasaurus.formtype_model
# a list of 3-tuples with default features:
# (feature_name, form_transformer, vectorizer)
FEATURES = [
    (
        "bias",
        features.Bias(),
        DictVectorizer(),
    ),
    (
        "form elements",
        features.FormElements(),
        DictVectorizer()
    ),
    (
        "<input type=submit value=...>",
        features.SubmitText(),
        CountVectorizer(ngram_range=(1,2), min_df=1, binary=True)
    ),
    (
        "<a> TEXT </a>",
        features.FormLinksText(),
        TfidfVectorizer(ngram_range=(1,2), min_df=4, binary=True,
                        stop_words={'and', 'or', 'of'})
    ),
    (
        "<label> TEXT </label>",
        features.FormLabelText(),
        TfidfVectorizer(ngram_range=(1,2), min_df=3, binary=True,
                        stop_words="english")
    ),

    (
        "<form action=...>",
        features.FormUrl(),
        TfidfVectorizer(ngram_range=(5,6), min_df=4, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<form class=... id=...>",
        features.FormCss(),
        TfidfVectorizer(ngram_range=(4,5), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input class=... id=...>",
        features.FormInputCss(),
        TfidfVectorizer(ngram_range=(4,5), min_df=5, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input name=...>",
        features.FormInputNames(),
        TfidfVectorizer(ngram_range=(5,6), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input title=...>",
        features.FormInputTitle(),
        TfidfVectorizer(ngram_range=(5,6), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
]


# clf = SGDClassifier(
#     penalty='elasticnet', 
#     loss='log', 
#     alpha=0.0002,
#     fit_intercept=False, 
#     shuffle=True, 
#     random_state=0,
#     n_iter=50,
# )
clf = LogisticRegression(penalty='l2', C=5, fit_intercept=False, random_state=0, tol=0.01)

# clf = LinearSVC(C=0.5, random_state=0, fit_intercept=False)
model = Pipeline([
    ('fe', _create_feature_union(FEATURES)),
    ('clf', clf),
])

evaluation.print_metrics(model, X, y, X_train, X_test, y_train, y_test, ipython=True)


Classification report (500 training examples, 134 testing examples):

             precision    recall  f1-score   support

          c       1.00      0.88      0.93         8
          l       0.96      0.96      0.96        28
          m       0.80      0.40      0.53        10
          o       0.69      1.00      0.82        18
          p       1.00      1.00      1.00        12
          r       1.00      0.76      0.87        17
          s       0.95      1.00      0.98        41

avg / total       0.92      0.91      0.91       134

Active features: 31521 out of possible 31521

Confusion matrix (rows=>true values, columns=>predicted values):


Unnamed: 0,c,l,m,o,p,r,s
c,7,0,0,1,0,0,0
l,0,27,0,1,0,0,0
m,0,0,4,4,0,0,2
o,0,0,0,18,0,0,0
p,0,0,0,0,12,0,0
r,0,1,1,2,0,13,0
s,0,0,0,0,0,0,41



Running cross validation...
10-fold cross-validation F1: 0.899 (±0.093)  min=0.796  max=0.951
CPU times: user 7.05 s, sys: 70.7 ms, total: 7.12 s
Wall time: 7.24 s


## Check what classifier learned

For linear classifiers like Logistic Regression or an SVM without a kernel we can check coefficient values to understand better how the decision is made. 

For correlated features weight will be spread across all correlated features, so just checking coefficients is not enough, but looking at them is useful anyways.

In [6]:
evaluation.print_informative_features(FEATURES, clf, 25)

c
+3.3912:                  form elements  has <textarea>
+1.9744:                  <a> TEXT </a>  contact
+1.9442:                  <a> TEXT </a>  contact us
+1.8133:                  <a> TEXT </a>  us
+1.5362:                  <a> TEXT </a>  privacy
+1.1287:        <form class=... id=...>  back
+0.9973:  <input type=submit value=...>  отправить
+0.6817:  <input type=submit value=...>  send
+0.6654:  <input type=submit value=...>  submit
+0.5960:  <input type=submit value=...>  küldök
+0.5960:  <input type=submit value=...>  neki
+0.5960:  <input type=submit value=...>  értesítőt
+0.5960:  <input type=submit value=...>  küldök neki
+0.5960:  <input type=submit value=...>  értesítőt küldök
+0.5702:          <label> TEXT </label>  message
+0.5678:          <label> TEXT </label>  address
+0.5454:          <label> TEXT </label>  имя
+0.5356:          <label> TEXT </label>  phone
+0.5278:          <label> TEXT </label>  city
+0.5172:        <form class=... id=...>  _form
+0.5172:        <f

## Compare results with "loginform" library

It is not possible to compare the results with "loginform" library directly because loginform

* always tries to return a login form even if the score is low;
* only detects login forms;
* in case of several forms returns a single form with the best score instead of deciding for each form whether to return it or not.

So we used two approaches:

1. Use `loginform._form_score` with different thresholds; assume that if score is greater than or equal to a threshold `loginform` detected a login form.
2. Train the same model, but using features from loginform library (weights will be learned instead of being hardcoded as 'score' increments/decrements).



### 1. loginform scores + thresholds

* **score >= -100** means "simply treat all forms as login forms".

* **score >= 0** all (or most) login forms are captured, but there are many false positives. 
  It is only slightly better than treating all forms as login forms.

* **score >= 10** F1 score is the best among all thresholds, 
  but the quality is significantly worse than F1 of ML-based models.
  
* **score >= 20** ~90% of detected login forms are correct, but most 
  login forms are not detected. Also, ~90% number is still lower than what ML-based models give us.


In [8]:
%%time 
import loginform


def labels_to_binary(y):
    """ Convert labels to 2-classes: login forms and non-login forms """
    return [tp == 'l' for tp in y]

    
def predict_loginform(X, threshold):
    """
    Return if forms are login or not using loginform
    library scores and a threshold.
    """
    return [
        (loginform._form_score(form) >= threshold)
        for form in X
    ]


def print_threshold_metrics(X_test, y_test, threshold):
    y_test = labels_to_binary(y_test)
    y_pred = predict_loginform(X_test, threshold)

    precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred, pos_label=True)
    print(
        "score >= %4d:    precision = %0.3f    recall = %0.3f    F1 = %0.3f" % (
        threshold, precision[1], recall[1], f1[1]
    ))


for threshold in [-100, -10, 0, 10, 20, 30]:
    print_threshold_metrics(X_test, y_test, threshold)

score >= -100:    precision = 0.209    recall = 1.000    F1 = 0.346
score >=  -10:    precision = 0.241    recall = 0.964    F1 = 0.386
score >=    0:    precision = 0.435    recall = 0.964    F1 = 0.600
score >=   10:    precision = 0.733    recall = 0.786    F1 = 0.759
score >=   20:    precision = 0.875    recall = 0.250    F1 = 0.389
score >=   30:    precision = 0.000    recall = 0.000    F1 = 0.000
CPU times: user 196 ms, sys: 1.24 ms, total: 197 ms
Wall time: 200 ms


  'precision', 'predicted', average, warn_for)


### 2. Use loginform features, but autodetect scores

The following ML-based model is trained using original loginform features (conditions used to increase or decrease the score). Roughly speaking, it uses the same information as loginform library, but instead of hardcoding `score += 10` and `score -= 10` the numbers are adjusted based on training data.

Note that the login form detection quality is significantly better than the quality of threshold-based model; it is only slightly worse than the quality of a "full" forms detection model. This means original loginform features are quite good at detecting login forms. But for other form types these features are not enough: other scores are bad.

In [9]:
%%time

LOGINFORM_FEATURES = [
    ('bias', features.Bias(), DictVectorizer()),
    ('loginform', features.OldLoginformFeatures(), DictVectorizer())
]
# loginform_clf = LinearSVC(C=0.5, fit_intercept=False)
loginform_clf = LogisticRegression(penalty='l2', C=5, fit_intercept=False, random_state=0)

model = make_pipeline(
    _create_feature_union(LOGINFORM_FEATURES), 
    loginform_clf,
)

evaluation.print_metrics(model, X, y, X_train, X_test, y_train, y_test, ipython=True)
evaluation.print_informative_features(LOGINFORM_FEATURES, loginform_clf, 25)


Classification report (500 training examples, 134 testing examples):

             precision    recall  f1-score   support

          c       0.88      0.88      0.88         8
          l       0.93      0.96      0.95        28
          m       0.00      0.00      0.00        10
          o       0.64      1.00      0.78        18
          p       0.00      0.00      0.00        12
          r       0.86      0.71      0.77        17
          s       0.67      0.90      0.77        41

avg / total       0.65      0.75      0.69       134

Active features: 56 out of possible 56

Confusion matrix (rows=>true values, columns=>predicted values):


  'precision', 'predicted', average, warn_for)


Unnamed: 0,c,l,m,o,p,r,s
c,7,0,0,1,0,0,0
l,0,27,0,1,0,0,0
m,0,0,0,0,0,0,10
o,0,0,0,18,0,0,0
p,0,0,0,5,0,0,7
r,1,2,0,1,0,12,1
s,0,0,0,2,0,2,37


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



Running cross validation...
10-fold cross-validation F1: 0.647 (±0.105)  min=0.573  max=0.728
c
+3.8501:                      loginform  typecount_text_gt1
+2.5960:                      loginform  typecount_password_0
+0.8810:                      loginform  typecount_text_0
-0.5189:                      loginform  typecount_radio_gt0
-1.2757:                      loginform  typecount_checkbox_gt1
-1.4178:                      loginform  typecount_password_eq1
-3.0174:                      loginform  2_or_3_inputs
-6.5263:                           bias  bias
--------------------------------------------------------------------------------
l
+4.6357:                      loginform  typecount_password_eq1
+0.6044:                      loginform  typecount_text_0
+0.0533:                      loginform  2_or_3_inputs
-1.6830:                      loginform  typecount_password_0
-1.7110:                      loginform  typecount_text_gt1
-2.0244:                           bias  bias
-2.09