Skip to content

Commit

Permalink
DOC improved tutorial and install instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
kmike committed Dec 17, 2015
1 parent e206042 commit b64e7f3
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 8 deletions.
1 change: 0 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@

import sys
import os
import shlex

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
Expand Down
20 changes: 16 additions & 4 deletions docs/install.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,30 @@
Install
=======

Formasaurus requires Python 2.7+ or 3.3+,
scipy, numpy, scikit-learn, sklearn-crfsuite and lxml to work.
Formasaurus requires Python 2.7+ or 3.3+ and the following Python packages:

First, make sure numpy is installed. Then, to install Formasaurus with all
* scipy_
* numpy_
* scikit-learn_ 0.17+
* sklearn-crfsuite_
* lxml_

.. _numpy: https://github.com/numpy/numpy
.. _scipy: https://github.com/scipy/scipy
.. _scikit-learn: https://github.com/scikit-learn/scikit-learn
.. _sklearn-crfsuite: https://github.com/TeamHG-Memex/sklearn-crfsuite
.. _lxml: https://github.com/lxml/lxml

First, make sure numpy_ is installed. Then, to install Formasaurus with all
its other dependencies run

::

pip install formasaurus[with-deps]

These packages may require extra steps to install, so the command above
may fail. In this case install dependencies manually, on by one, and
may fail. In this case install dependencies manually, on by one
(follow their install instructions),
then run::

pip install formasaurus
93 changes: 90 additions & 3 deletions docs/usage.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
Usage
=====

Basic Usage
-----------

Grab some HTML:

>>> import requests
Expand All @@ -19,15 +22,30 @@ to detect form and field types:
'user[password]': 'password'},
'form': 'registration'})]

It returns a list of (form, info) tuples, one tuple for each ``<form>``
element on a page. ``info`` dict contains form and field types.

.. note::

To detect form and field types Formasaurus needs to train prediction
models on user machine. This is done automatically on first call;
models are saved to a file and then reused.

:func:`formasaurus.extract_forms <formasaurus.classifiers.extract_forms>`
returns a list of (form, info) tuples, one tuple for each ``<form>``
element on a page. ``form`` is a lxml Element for a form,
``info`` dict contains form and field types.

Only fields which are

1. visible to user;
2. have non-empty ``name`` attribute

are returned - other fields usually should be either submitted as-is
(hidden fields) or not sent to the server at all (fields without
``name`` attribute).

There are edge cases like fields filled with JS or fields which are made
invisible using CSS, but all bets are off if page uses JS heavily and all
we have is HTML source.

By default, info dict contains only most likely form and field types.
To get probabilities pass ``proba=True``:

Expand Down Expand Up @@ -77,3 +95,72 @@ In this example the data is loaded from an URL; of course, data may be
loaded from a local file or from an in-memory object, or you may already
have the tree loaded (e.g. with Scrapy).

Form Types
----------

By default, Formasaurus detects these form types:

* ``search``
* ``login``
* ``registration``
* ``password/login recovery``
* ``contact/comment``
* ``join mailing list``
* ``order/add to cart``
* all other forms are classified as ``other``.

Field Types
-----------

By deafult, Formasaurus detects these field types:

* ``username``
* ``password``
* ``password confirmation`` - "enter the same password again"
* ``email``
* ``email confirmation`` - "enter the same email again"
* ``username or email`` - a field where both username and email are accepted
* ``captcha`` - image captcha or a puzzle to solve
* ``honeypot`` - this field usually should be left blank
* ``TOS confirmation`` - "I agree with Terms of Service",
"I agree to follow website rules", "It is OK to process my personal info", etc.
* ``receive emails confirmation`` - a checkbox which means
"yes, it is ok to send me some sort of emails"
* ``remember me checkbox`` - common on login forms
* ``submit button`` - a button user should click to submit this form
* ``cancel button``
* ``reset/clear button``
* ``first name``
* ``last name``
* ``middle name``
* ``full name``
* ``organization name``
* ``gender``
* ``day``
* ``month``
* ``year``
* ``full date``
* ``time zone``
* ``DST`` - Daylight saving time preference
* ``country``
* ``city``
* ``state``
* ``address`` - other address information
* ``postal code``
* ``phone`` - phone number or its part
* ``fax``
* ``url``
* ``OpenID``
* ``about me text``
* ``comment text``
* ``comment title or subject``
* ``security question`` - "mother's maiden name"
* ``answer to security question``
* ``search query``
* ``search category / refinement`` - search parameter, filtering option
* ``product quantity``
* ``style select`` - style/theme select, common on forums
* ``sorting option`` - asc/desc order, items per page
* ``other number``
* ``other read-only`` - field with information user shouldn't change
* all other fields are classified as ``other``.

0 comments on commit b64e7f3

Please sign in to comment.