# Form Classifier Using Machine Learning 

## Introducing the dataset

The model is trained on 1000+ annotated web forms - click [dataset](https://github.com/RaoUmer/Formasaurus/tree/master/formasaurus/data). Most pages to annotate were selected randomly from [Alexa](http://www.alexa.com/topsites) Top 1M websites.

## Introducing the "Formasaurus" 

**Formasaurus** is a Python package that tells you the type of an HTML form and its fields using machine learning.

It can detect if a form is a login, search, registration, password recovery, join mailing list, contact, order form or something else, which field is a password field and which is a search query, etc.

Formasaurus uses two separate ML models for **form type detection** and for **field type detection**. Field type detector uses form type detection results to improve the quality.


### Form Type Detection

To detect HTML form types Formasaurus takes a <b>form</b>  element and uses a **linear classifier (Logistic Regression/SVM)** to choose its type from a predefined set of types. <br></br>
<br></br>
Features include:
* counts of form elements of different types,
* whether a form is POST or GET,
* text on submit buttons,
* names and char ngrams of CSS classes and IDs,
* input labels,
* presence of certain substrings in URLs,
* etc.

### Field Type Detection

To detect form field types Formasaurus uses <b>Conditional Random Field (CRF)</b> model. All fields in an HTML form is a sequence where order matters; CRF allows to take field order in account.

Features include:

* form type predicted by a form type detector,
* field tag name,
* field value,
* text before and after field,
* field CSS class and ID,
* text of field label element,
* field title and placeholder attributes,
* etc.

In [1]:
# importing required modules
from formasaurus import formtype_features as features
from formasaurus import formtype_model
from formasaurus.storage import Storage, load_html

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
#from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, precision_recall_fscore_support

## Dataset

In [2]:
storage = Storage("C:/Users/raoumer/Desktop/form_classifier_ml/data")
storage.check()
storage.print_form_type_counts(simplify=True)

Checking: 100%|###########################| 946/946 [00:18<00:00, 55.76 files/s]


Status: OK
Annotated HTML forms (simplified classes):





415   search                    (s)
246   login                     (l)
165   registration              (r)
143   other                     (o)
138   contact/comment           (c)
132   join mailing list         (m)
105   password/login recovery   (p)
74    order/add to cart         (b)

Total form count: 1418


## Load training / evaluation data

In [3]:
# Load training / evaluation data
annotations = list(storage.iter_annotations(
    simplify_form_types=True,
    simplify_field_types=True,
    verbose=True,
    leave=True,        
))
X, y = formtype_model.get_Xy(annotations, full_type_names=True)
#print "X:",X
print "y:", y

Loading: 946 files [00:07, 101.88 files/s]



y: [u'search' u'login' u'search' ..., u'search' u'registration' u'login']


## Useful features for searchable form page

### Search forms

* a single query field
* a field named "q" or "s"
* "search" in URL
* "search" in submit button text (submit value)
* "search" in form css class or id
* no password field
* method == GET?

In [4]:
%%time
# reload(features)
from formasaurus.formtype_model import _create_feature_union

# a list of 3-tuples with default features:
# (feature_name, form_transformer, vectorizer)
FEATURES = [
    (
        "form elements",
        features.FormElements(),
        DictVectorizer()
    ),
    (
        "<input type=submit value=...>",
        features.SubmitText(),
        CountVectorizer(ngram_range=(1,2), min_df=1, binary=True)
    ),
    (
        "<a> TEXT </a>",
        features.FormLinksText(),
        TfidfVectorizer(ngram_range=(1,2), min_df=4, binary=True,
                        stop_words={'and', 'or', 'of'})
    ),
    (
        "<label> TEXT </label>",
        features.FormLabelText(),
        TfidfVectorizer(ngram_range=(1,2), min_df=3, binary=True,
                        stop_words="english")
    ),

    (
        "<form action=...>",
        features.FormUrl(),
        TfidfVectorizer(ngram_range=(5,6), min_df=4, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<form class=... id=...>",
        features.FormCss(),
        TfidfVectorizer(ngram_range=(4,5), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input class=... id=...>",
        features.FormInputCss(),
        TfidfVectorizer(ngram_range=(4,5), min_df=5, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input name=...>",
        features.FormInputNames(),
        TfidfVectorizer(ngram_range=(5,6), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
    (
        "<input title=...>",
        features.FormInputTitle(),
        TfidfVectorizer(ngram_range=(5,6), min_df=3, binary=True,
                        analyzer="char_wb")
    ),
]

Wall time: 0 ns


## Classification Model

In [9]:
clf = LinearSVC(C=0.5, random_state=150, fit_intercept=False)
model = Pipeline([
    ('fe', _create_feature_union(FEATURES)),
    ('clf', clf),
])
model.fit(X, y)
print model.predict(X)
print model.decision_function(X)
formtype_model.print_classification_report(annotations, n_folds=10, model=model)

[u'search' u'login' u'search' ..., u'search' u'registration' u'login']
[[-1.25773448 -1.70821209 -1.39611482 ..., -1.64538027 -1.37630567
   2.87372432]
 [-1.93420775 -1.74188827  1.58841722 ..., -1.40369579 -1.62162767
  -1.44747327]
 [-1.20032282 -1.60389314 -1.25234017 ..., -1.607422   -1.43255901
   2.42709097]
 ..., 
 [-1.23189527 -1.49399696 -1.15474402 ..., -1.54887777 -1.57957592
   2.7126756 ]
 [-1.31487788 -0.91956482 -0.92719609 ..., -2.1374183   0.91700013
  -2.5699984 ]
 [-1.47015145 -1.7952439   0.9613531  ..., -1.48798736 -1.17446817
  -2.56394994]]
                         precision    recall  f1-score   support

                 search       0.92      0.96      0.94       415
                  login       0.96      0.96      0.96       246
           registration       0.95      0.87      0.91       165
password/login recovery       0.86      0.84      0.85       105
        contact/comment       0.85      0.94      0.89       138
      join mailing list       0.88    

In [8]:
from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier

In [11]:
def ROC_multi_class(Xtr, ytr, clf):
    
    classes = [0,1, 2, 3, 4, 5, 6, 7]
    # Binarize the output
    ytr = label_binarize(ytr, classes=classes)
    n_classes = ytr.shape[1]
    
    
    #random_state = np.random.RandomState(1)
    
    # shuffle and split training and test sets
    X_train, X_test, y_train, y_test = train_test_split(Xtr, ytr, test_size=.30, random_state=40)
    
    # Learn to predict each class against the other
    classifier = OneVsRestClassifier(clf)
    classifier.fit(X_train, y_train)
    y_pred_score = classifier.decision_function(X_test)    
    ytt = label_binarize(y_test, classes=classes)
    
    
    # Compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(ytt[:, i], y_pred_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    
    # Plot ROC curves for the multiclass
    for i in range(n_classes):
        plt.plot(fpr[i], tpr[i], label='ROC curve of class {0} (area = {1:0.2f})'
                                       ''.format(i+1, roc_auc[i]))
    
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC of multi-class')
    plt.legend(loc="lower right")
    plt.show()

In [12]:
ROC_multi_class(X, y, model)

  mask |= (ar1 == a)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


AttributeError: 'list' object has no attribute 'shape'

## Usage

In [7]:
# Get some HTML page
import requests
html = requests.get('https://www.github.com/').text
#print html

In [2]:
import formasaurus
print formasaurus.extract_forms(html)

[(<Element form at 0x89bbd18>, {'fields': {'q': 'search query'}, 'form': u'search'}), (<Element form at 0x89bbd68>, {'fields': {'user[password]': 'password', 'user[login]': 'username', 'user[email]': 'email'}, 'form': u'registration'})]


By default, info dict contains only most likely form and field types. To get probabilities pass **proba=True**:

In [3]:
print formasaurus.extract_forms(html, proba=True, threshold=0.05)

[(<Element form at 0x8d26408>, {'fields': {'q': {'search query': 0.999437231678832}}, 'form': {u'search': 0.99570326032828704}}), (<Element form at 0x8d26458>, {'fields': {'user[password]': {'password': 0.9988987254599053}, 'user[login]': {'username': 0.9853568952022753}, 'user[email]': {'email': 0.9998668551783014}}, 'form': {u'login': 0.13553890385438713, u'registration': 0.85639807222240072}})]


If field types are not needed we can speed up processing using **fields=False** option. In this case 'fields' results won't be computed:

In [4]:
print formasaurus.extract_forms(html, fields=False)

[(<Element form at 0x8d264a8>, {'form': u'search'}), (<Element form at 0x8d26c78>, {'form': u'registration'})]


## Testing form pages from PIEAS website

In [3]:
import requests
import formasaurus

In [4]:
html = requests.get('https://sites.google.com/site/drmabidm/').text
print formasaurus.extract_forms(html)

[(<Element form at 0xb0fa318>, {'fields': {'q': 'search query'}, 'form': u'search'})]


In [5]:
html = requests.get('https://sites.google.com/site/pnetlab786/').text
print formasaurus.extract_forms(html)

[(<Element form at 0xb0fab38>, {'fields': {'q': 'search query'}, 'form': u'search'})]


In [6]:
html = requests.get('http://www.pieas.edu.pk/').text
print formasaurus.extract_forms(html)

[]


In [7]:
html = requests.get('http://faculty.pieas.edu.pk/fayyaz/').text
print formasaurus.extract_forms(html)

[]


## Reference

* https://github.com/TeamHG-Memex/Formasaurus.git
* http://formasaurus.readthedocs.org/en/latest/