## Html features
Extracting DOM features from a dataset of romanian e-commerce sites.

### Motivation
In this document we will be extracting both per-page and per-tag features. These features will later be explored to see if there is any hypothesis that can be intuitively deduced from their observation. Nonetheless, they will all be used later to try and train and judge their significance this way.

I will be focusing on features that can be extracted merely from the DOM(tree structure) without using things such as CSS or styling. Such features can include the depth of each node in the tree, the tag type, their contents etc., and generally, features of the nodes in relation to the tree-structure of the page.

Overall, I will be extracting several of these features and present, for each, a rough rationale of why they might be relevant to identifying content.

In [1]:
%matplotlib inline
# standard library
import itertools
import ast

# pandas
import pandas as pd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# lxml
from lxml import etree

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

In [2]:
# load the data
df = pd.read_csv('../data/2017-07-21-13:41:49-ecommerce-ro.csv')
df.head()

Unnamed: 0,html,url
0,<!DOCTYPE html>\r\n<html>\r\n <head>\r\n <...,https://brx.ro/
1,<!DOCTYPE html>\n<html>\n <head>\n <link r...,https://brx.ro/adauga-anunt
2,<!DOCTYPE html>\n<html>\n <head>\n <link r...,https://brx.ro/anunturi/categorie/auto-moto-si...
3,<!DOCTYPE html>\n<html>\n <head>\n <link r...,https://brx.ro/anunturi/categorie/imobiliare
4,<!DOCTYPE html>\n<html>\n <head>\n <link r...,https://brx.ro/anunturi/categorie/electronice-...


In [3]:
# get a test series of nodes
text = df.html[0]
root = etree.HTML(text)
nodes = list(root.iter())

### Features
#### Depth
The depth of each html node in the html tree. This might be relevant in deciding roughly on what level of nesting the content resides on.

In [4]:
def depth(node):
    d = 0
    while node is not None:
        d += 1
        node = node.getparent()
    return d


def extract_depths(nodes):
    """Returns a Series of the depths of the nodes"""
    return pd.Series(data=(depth(node) for node in nodes))


# test
print('DEPTH')
extract_depths(nodes).head()

DEPTH


0    1
1    2
2    3
3    3
4    3
dtype: int64

### Children
Extracting the number of children of each node. The number of children, obviously tels us which nodes are laves and which not, and also might convey semantic information such as whether the node is an element of a *list-like* structure.

In [5]:
def extract_no_children(nodes):
    """Returns a Series of the number of children for each node"""
    return pd.Series(data=(len(node.getchildren()) for node in nodes))

print('# OF CHILDREN\n')
print(extract_no_children(nodes).head())

# OF CHILDREN

0     2
1    30
2     0
3     0
4     0
dtype: int64


### Tag type
This one is self-explanantory. Html provides a lot of *content-aware* tags such as `<article>` or `<aside>`, which, if used correctly, can offer predictive value for our models. Even standard tags such as `<ul>`s might be indicators for a model. 

In [6]:
def extract_tag_types(nodes):
    return pd.Series(data=(node.tag if type(node.tag) is str else 'comment' for node in nodes))


# test
print('TAG TYPES')
extract_tag_types(nodes).head()

TAG TYPES


0     html
1     head
2     link
3     meta
4    title
dtype: object

### Text
Knowing whether a tag has text, directly indicates whther it contains human-readable content.

In [7]:
def extract_has_text(nodes):
    """Returns whether each one of the nodes contains text."""
    def has_text(nodes):
        for node in nodes:
            if node.text is None:
                yield False
            else:
                yield bool(node.text.strip())
    return pd.Series(data=has_text(nodes))


# test
print('HAS TEXT')
extract_has_text(nodes).head()

HAS TEXT


0    False
1    False
2    False
3    False
4     True
dtype: bool

### Classes, ids and attributes
Due to the fact that classes and ids are specific to each website, for this document we will try to extract numerical fetures related to them. For each node, we will be extracting the number of classes it has associated and whether it has an id or not. Pages written with the same frontend frameworks might have similar features for the same types of semantic contents.

**NOTE:** We will also insert the classes as id as a separate column to later be able to extract page-wide features such as the number of unique classes and total number of ids.

**TODO:** Parsing class and id names(and also text content like the one from the previous section)(ie. removing punctuation, splitting by camelcase etc.) we then may be able to extract textual features from them, but this is byond the scope of this notebook.

In [8]:
def extract_classes(nodes):
    """Extract a Series of class lists for each node."""
    return pd.Series(data=(node.attrib.get('class', '').split() for node in nodes))


def extract_has_id(nodes):
    """Extract a Series of bools telling whether the component 
    has and id attribute or not."""
    return pd.Series(data=('id' in node.attrib for node in nodes))


def extract_no_classes(nodes):
    """Extracts the number of classes for each node"""
    return pd.Series(data=(len(classes) for classes in extract_classes(nodes)))


print('CLASSES')
print(extract_classes(nodes[50:75]).head())

print('\nHAS ID')
print(extract_has_id(nodes[50:75]).head())

print('\n# OF CLASSES')
print(extract_no_classes(nodes[50:75]).head())


CLASSES
0                  []
1              [left]
2    [homeTextFilter]
3              [left]
4    [homeTextFilter]
dtype: object

HAS ID
0    False
1    False
2    False
3    False
4     True
dtype: bool

# OF CLASSES
0    0
1    1
2    1
3    1
4    1
dtype: int64


In [9]:
def extract_node_features(nodes):
    """Returns a dataframe of features from the nodes"""
    depth_features = extract_depths(nodes)  # depths
    tag_type_features = extract_tag_types(nodes)  # tag types
    no_classes_features = extract_no_classes(nodes)  # # of classes
    has_id_features = extract_has_id(nodes)  # has id 
    no_children_features = extract_no_children(nodes) # # of children
    has_text_features = extract_has_text(nodes)  # has text
    classes_features = extract_classes(nodes)
    
    # series of features, and their names
    series = [depth_features, tag_type_features, no_classes_features, 
              has_id_features, no_children_features, has_text_features,
              classes_features]
    columns = ['depth', 'tag', 'no_classes', 'has_id', 'no_children', 'has_text', 'classes']
    df_items = zip(columns, series)
    
    return pd.DataFrame.from_items(df_items)
    
extract_node_features(nodes[30:35])

Unnamed: 0,depth,tag,no_classes,has_id,no_children,has_text,classes
0,3,script,0,False,0,False,[]
1,3,script,0,False,0,True,[]
2,2,body,1,False,9,False,[homepage]
3,3,header,1,False,1,False,[standardwidth]
4,4,div,1,True,1,False,[clearfix]


### Ancestor and descendant features
Until now, tag features have had no knowledge of descendants and ancestors. We can specify a a height and depth to traverse up and down from the given node to extract features from. For ancestors we can extract the same features as for this tag(name, nr of children, classes,  etc.). We can later vary this height to increase or decrease the knowledge of previous and following nodes, experimenting with different values and their performances.

**NOTE:** From the number of children of one node's parent we can also infer the number of sibligs of that node.

Descendants and siblings do not have fixed number on their respective levels, but we can, however add aggregate features at each level of depth:

* tag count for each tag type at tat depth from the node
* avearge number of classes on that levels
* **TODO:** add others

In [10]:
def get_ancestors(node, height):
    """Returns a list of ancestor nodes up to the given height"""
    current_node = node.getparent()
    current_height = 1
    while current_node != None and current_height <= height:
        yield current_node
        current_node = current_node.getparent()
        current_height += 1  # increment the height
    
    
def extract_ancestor_features(nodes, height):
    """Extracts features from the ancestors of the node up to the given 
    height. Pads with the w"""
    # preextract them for ease
    node_feats = extract_node_features(nodes)
    features = (feat[1].values 
                for feat in node_feats.iterrows())
    nodes_features = dict(zip(nodes, features))
    feature_names = node_feats.columns # the names of the features
    feature_dtypes = node_feats.dtypes  # the types
    
    # add a feature placeholder to pad elemnt that are high
    # enough in the tree not to have enough ancestors
    feature_rows = []
    empty_row = np.array([0, '', 0, False, 0, False, list()], dtype=object)
    for node in nodes:
        # traverse ancestors
        feature_list = [nodes_features[ancestor] 
                        for ancestor in get_ancestors(node, height)] 
        # pad with empty ancestor features
        feature_list.extend([empty_row] * (height - len(feature_list)))
        feature_rows.append(np.hstack(feature_list))

    # rename the columns
    column_names = []
    for i in range(1, height+1):
        for name in feature_names:
            column_names.append('ancestor{}_{}'.format(i, name))
            
    # reconvert them to the original types
    column_dtypes = dict(zip(column_names, feature_dtypes.tolist() * height))
    return pd.DataFrame(data=np.vstack(feature_rows), 
                        columns=column_names).astype(column_dtypes)

extract_ancestor_features(nodes, 3)[35:40]

Unnamed: 0,ancestor1_depth,ancestor1_tag,ancestor1_no_classes,ancestor1_has_id,ancestor1_no_children,ancestor1_has_text,ancestor1_classes,ancestor2_depth,ancestor2_tag,ancestor2_no_classes,...,ancestor2_no_children,ancestor2_has_text,ancestor2_classes,ancestor3_depth,ancestor3_tag,ancestor3_no_classes,ancestor3_has_id,ancestor3_no_children,ancestor3_has_text,ancestor3_classes
35,4,div,1,True,1,False,[clearfix],3,header,1,...,1,False,[standardwidth],2,body,1,False,9,False,[homepage]
36,5,div,2,False,4,False,"[container_12, headerbtns]",4,div,1,...,1,False,[clearfix],3,header,1,False,1,False,[standardwidth]
37,6,div,1,True,1,False,[grid_3],5,div,2,...,4,False,"[container_12, headerbtns]",4,div,1,True,1,False,[clearfix]
38,7,a,0,False,1,False,[],6,div,1,...,1,False,[grid_3],5,div,2,False,4,False,"[container_12, headerbtns]"
39,5,div,2,False,4,False,"[container_12, headerbtns]",4,div,1,...,1,False,[clearfix],3,header,1,False,1,False,[standardwidth]


In [11]:
def extract_features(df, height):
    """Returns a dataframe of the features for all the nodes,
    including the ancestor ones"""
    
    page_features_list = []
    
    for page in df.itertuples():
        # get the nodes and compute the features
        nodes = list(etree.HTML(page.html).iter())
        node_features = extract_node_features(nodes)
        ancestor_features = extract_ancestor_features(nodes, height)
    
        features = pd.concat([node_features, ancestor_features], axis='columns')
        features['url'] = page.url
    
        # append to the list
        page_features_list.append(features)
        
    return pd.concat(page_features_list, axis='rows', ignore_index=True)
 
    
# test
feat_df = extract_features(df, 3)

In [12]:
feat_df.head()

Unnamed: 0,depth,tag,no_classes,has_id,no_children,has_text,classes,ancestor1_depth,ancestor1_tag,ancestor1_no_classes,...,ancestor2_has_text,ancestor2_classes,ancestor3_depth,ancestor3_tag,ancestor3_no_classes,ancestor3_has_id,ancestor3_no_children,ancestor3_has_text,ancestor3_classes,url
0,1,html,0,False,2,False,[],0,,0,...,False,[],0,,0,False,0,False,[],https://brx.ro/
1,2,head,0,False,30,False,[],1,html,0,...,False,[],0,,0,False,0,False,[],https://brx.ro/
2,3,link,0,False,0,False,[],2,head,0,...,False,[],0,,0,False,0,False,[],https://brx.ro/
3,3,meta,0,False,0,False,[],2,head,0,...,False,[],0,,0,False,0,False,[],https://brx.ro/
4,3,title,0,False,0,True,[],2,head,0,...,False,[],0,,0,False,0,False,[],https://brx.ro/


In [13]:
# save it
feat_df.to_csv('../2017-07-21-13:41:49-ecommerce-ro-features.csv')