## Html features
Extracting DOM features from a dataset of romanian e-commerce sites. This notebook assumes having the dataset downloaded.

### Motivation
In this document we will be extracting both per-page and per-tag features. These features will later be explored to see if there is any hypothesis that can be intuitively deduced from their observation. Nonetheless, they will all be used later to try and train and judge their significance this way.

I will be focusing on features that can be extracted merely from the DOM(tree structure) without using things such as CSS or styling. Such features can include the depth of each node in the tree, the tag type, their contents etc., and generally, features of the nodes in relation to the tree-structure of the page.

Overall, I will be extracting several of these features and present, for each, a rough rationale of why they might be relevant to identifying content.

![label](imgs/labelvis.png)

In [1]:
%matplotlib inline
# standard library
import itertools
import ast

# pandas
import pandas as pd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# lxml
from lxml import etree

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

We will be defining the constants here. Thsi way they can, later, be parametrized to to be ran automatically.

In [2]:
FIRST_RAW_FILENAME = '../data/raw/first-ecommerce.csv'
FIRST_DOM_FEATURES_FILENAME = '../data/interim/first-dom-features.csv'

### Reading the data 

In [3]:
# load the data
df = pd.read_csv(FIRST_RAW_FILENAME)
df.head()

Unnamed: 0,html,url
0,"<!DOCTYPE html><html lang=""ro"" class=""""><head>...",https://www.emag.ro/resigilate/placi_video/c?r...
1,"<!DOCTYPE html><html xml:lang=""ro"" lang=""ro"" c...",https://www.emag.ro/resigilate/ventilatoare-pc...
2,"<!DOCTYPE html><html xmlns:og=""http://ogp.me/n...",https://www.olx.ro/auto-masini-moto-ambarcatiu...
3,"<!DOCTYPE html><html lang=""ro"" class=""""><head>...",https://www.emag.ro/resigilate
4,"<!DOCTYPE html><html xml:lang=""ro"" lang=""ro"" c...","https://www.emag.ro/label/pret,intre-200-si-50..."


In [4]:
# get a test series of nodes
text = df.html[0]
root = etree.HTML(text)
nodes = list(root.iter())

### Features
![tree](imgs/dom.png)
#### Depth
The depth of each html node in the html tree. This might be relevant in deciding roughly on what level of nesting the content resides on.

In [5]:
def depth(node):
    d = 0
    while node is not None:
        d += 1
        node = node.getparent()
    return d


def extract_depths(nodes):
    """Returns a Series of the depths of the nodes"""
    return pd.Series(data=(depth(node) for node in nodes))


# test
print('DEPTH')
extract_depths(nodes).head()

DEPTH


0    1
1    2
2    3
3    3
4    3
dtype: int64

### Children
Extracting the number of children of each node. The number of children, obviously tels us which nodes are laves and which not, and also might convey semantic information such as whether the node is an element of a *list-like* structure.

In [6]:
def extract_no_children(nodes):
    """Returns a Series of the number of children for each node"""
    return pd.Series(data=(len(node.getchildren()) for node in nodes))

print('# OF CHILDREN\n')
print(extract_no_children(nodes).head())

# OF CHILDREN

0     2
1    38
2     0
3     0
4     0
dtype: int64


### Tag type
This one is self-explanantory. Html provides a lot of *content-aware* tags such as `<article>` or `<aside>`, which, if used correctly, can offer predictive value for our models. Even standard tags such as `<ul>`s might be indicators for a model. 

In [7]:
def extract_tag_types(nodes):
    return pd.Series(data=(node.tag if type(node.tag) is str else 'comment' for node in nodes))


# test
print('TAG TYPES')
extract_tag_types(nodes).head()

TAG TYPES


0     html
1     head
2     meta
3    title
4     meta
dtype: object

### Text
Knowing whether a tag has text, directly indicates whether it contains human-readable content. We will however, not add it as a binary feature but as the total number of characters for the give tag.

In [8]:
def extract_text_len(nodes):
    """Returns the number of characters of text for the given node."""
    def get_text_len(nodes):
        for node in nodes:
            text = '' if node.tag is etree.Comment or node.tag is etree.PI else ''.join(node.itertext())
            yield len(text)
    return pd.Series(data=get_text_len(nodes))


# test
print('TEXT LEN')
extract_text_len(nodes).head()

TEXT LEN


0    139084
1      2379
2         0
3        40
4         0
dtype: int64

### Classes, ids and attributes
Due to the fact that classes and ids are specific to each website, for this document we will try to extract numerical features related to them. For each node, we will be extracting the number of classes it has associated and whether it has an id or not. Pages written with the same frontend frameworks might have similar features for the same types of semantic contents.

We will be extracting the length of the **text** and **id** and **class** attributes. This provides some form of **Shallow text features** without having to resort to complex NLP. We are not considering the number of classes to fit into this category as it only needs whitespace splitting and is a declarative feature of HTML itself fitting into our category of **DOM features**.

**NOTE:** We will also insert the classes as id as a separate column to later be able to extract page-wide features such as the number of unique classes and total number of ids.

**TODO:** Parsing class and id names(and also text content like the one from the previous section)(ie. removing punctuation, splitting by camelcase etc.) we then may be able to extract textual features from them, but this is beyond the scope of this notebook.

In [9]:
def extract_classes(nodes):
    """Extract a Series of class lists for each node."""
    return pd.Series(data=(node.attrib.get('class', '').split() for node in nodes))


def extract_attr_len(nodes, attr_name='id'):
    """Extract a Series of bools telling whether the component 
    has and id attribute or not."""
    return pd.Series(data=(len(node.attrib.get(attr_name, '')) for node in nodes))


def extract_no_classes(nodes):
    """Extracts the number of classes for each node"""
    return pd.Series(data=(len(classes) for classes in extract_classes(nodes)))


print('CLASSES')
print(extract_classes(nodes[50:75]).head())

print('\nHAS ID')
print(extract_attr_len(nodes[50:75]).head())

print('\n# OF CLASSES')
print(extract_no_classes(nodes[50:75]).head())

CLASSES
0       [megamenu-container, megamenu-container-fixed]
1                                           [megamenu]
2                            [megamenu-list-container]
3                                      [megamenu-list]
4    [megamenu-list-department, js-megamenu-list-de...
dtype: object

HAS ID
0    0
1    0
2    0
3    0
4    0
dtype: int64

# OF CLASSES
0    2
1    1
2    1
3    1
4    2
dtype: int64


In [10]:
def extract_node_features(nodes):
    """Returns a dataframe of features from the nodes"""
    depth_features = extract_depths(nodes)  # depths
    tag_type_features = extract_tag_types(nodes)  # tag types
    no_classes_features = extract_no_classes(nodes)  # # of classes
    id_len_features = extract_attr_len(nodes)  # id len
    class_len_features = extract_attr_len(nodes, 'class')  # class len
    no_children_features = extract_no_children(nodes) # # of children
    text_len_features = extract_text_len(nodes)  # text length
    
    
    # series of features, and their names
    series = [depth_features, tag_type_features, no_classes_features, 
              id_len_features, class_len_features, no_children_features,
              text_len_features]
    columns = ['depth', 'tag', 'no_classes', 'id_len','class_len',
               'no_children', 'text_len']
    df_items = zip(columns, series)
    
    return pd.DataFrame.from_items(df_items)
    
extract_node_features(nodes[30:35])

Unnamed: 0,depth,tag,no_classes,id_len,class_len,no_children,text_len
0,3,script,0,0,0,0,0
1,3,script,0,0,0,0,0
2,3,script,0,0,0,0,0
3,3,script,0,0,0,0,0
4,3,script,0,0,0,0,1537


### Ancestor and descendant features
Until now, tag features have had no knowledge of descendants and ancestors. We can specify a a height and depth to traverse up and down from the given node to extract features from. For ancestors we can extract the same features as for this tag(name, nr of children, classes,  etc.). We can later vary this height to increase or decrease the knowledge of previous and following nodes, experimenting with different values and their performances.

**NOTE:** From the number of children of one node's parent we can also infer the number of sibligs of that node.

Descendants and siblings do not have fixed number on their respective levels, but we can, however add aggregate features at each level of depth:

* tag count for each tag type at tat depth from the node
* avearge number of classes on that levels
* **TODO:** add others

In [11]:
def get_ancestors(node, height):
    """Returns a list of ancestor nodes up to the given height"""
    current_node = node.getparent()
    current_height = 1
    while current_node != None and current_height <= height:
        yield current_node
        current_node = current_node.getparent()
        current_height += 1  # increment the height
    
    
def extract_ancestor_features(nodes, height):
    """Extracts features from the ancestors of the node up to the given 
    height. Pads with the w"""
    # preextract them for ease
    node_feats = extract_node_features(nodes)
    features = (feat[1].values 
                for feat in node_feats.iterrows())
    nodes_features = dict(zip(nodes, features))
    feature_names = node_feats.columns # the names of the features
    feature_dtypes = node_feats.dtypes  # the types
    
    # add a feature placeholder to pad elemnt that are high
    # enough in the tree not to have enough ancestors
    feature_rows = []
    empty_row = np.array([0, '', 0, 0, 0, 0, 0], dtype=object)
    for node in nodes:
        # traverse ancestors
        feature_list = [nodes_features[ancestor] 
                        for ancestor in get_ancestors(node, height)] 
        # pad with empty ancestor features
        feature_list.extend([empty_row] * (height - len(feature_list)))
        feature_rows.append(np.hstack(feature_list))

    # rename the columns
    column_names = []
    for i in range(1, height+1):
        for name in feature_names:
            column_names.append('ancestor{}_{}'.format(i, name))
            
    # reconvert them to the original types
    column_dtypes = dict(zip(column_names, feature_dtypes.tolist() * height))
    return pd.DataFrame(data=np.vstack(feature_rows), 
                        columns=column_names).astype(column_dtypes)

extract_ancestor_features(nodes, 3)[35:40]

Unnamed: 0,ancestor1_depth,ancestor1_tag,ancestor1_no_classes,ancestor1_id_len,ancestor1_class_len,ancestor1_no_children,ancestor1_text_len,ancestor2_depth,ancestor2_tag,ancestor2_no_classes,...,ancestor2_class_len,ancestor2_no_children,ancestor2_text_len,ancestor3_depth,ancestor3_tag,ancestor3_no_classes,ancestor3_id_len,ancestor3_class_len,ancestor3_no_children,ancestor3_text_len
35,2,head,0,0,0,38,2379,1,html,0,...,0,2,139084,0,,0,0,0,0,0
36,2,head,0,0,0,38,2379,1,html,0,...,0,2,139084,0,,0,0,0,0,0
37,2,head,0,0,0,38,2379,1,html,0,...,0,2,139084,0,,0,0,0,0,0
38,2,head,0,0,0,38,2379,1,html,0,...,0,2,139084,0,,0,0,0,0,0
39,2,head,0,0,0,38,2379,1,html,0,...,0,2,139084,0,,0,0,0,0,0


In [12]:
def extract_features(df, height):
    """Returns a dataframe of the features for all the nodes,
    including the ancestor ones"""
    
    page_features_list = []
    
    for page in df.itertuples():
        # get the nodes and compute the features
        nodes = list(etree.HTML(page.html).iter())
        node_features = extract_node_features(nodes)
        ancestor_features = extract_ancestor_features(nodes, height)
    
        features = pd.concat([node_features, ancestor_features], axis='columns')
        features['url'] = page.url
    
        # append to the list
        page_features_list.append(features)
        
    return pd.concat(page_features_list, axis='rows', ignore_index=True)
 
    
# test
feat_df = extract_features(df, 3)

In [13]:
feat_df.head()

Unnamed: 0,depth,tag,no_classes,id_len,class_len,no_children,text_len,ancestor1_depth,ancestor1_tag,ancestor1_no_classes,...,ancestor2_no_children,ancestor2_text_len,ancestor3_depth,ancestor3_tag,ancestor3_no_classes,ancestor3_id_len,ancestor3_class_len,ancestor3_no_children,ancestor3_text_len,url
0,1,html,0,0,0,2,139084,0,,0,...,0,0,0,,0,0,0,0,0,https://www.emag.ro/resigilate/placi_video/c?r...
1,2,head,0,0,0,38,2379,1,html,0,...,0,0,0,,0,0,0,0,0,https://www.emag.ro/resigilate/placi_video/c?r...
2,3,meta,0,0,0,0,0,2,head,0,...,2,139084,0,,0,0,0,0,0,https://www.emag.ro/resigilate/placi_video/c?r...
3,3,title,0,0,0,0,40,2,head,0,...,2,139084,0,,0,0,0,0,0,https://www.emag.ro/resigilate/placi_video/c?r...
4,3,meta,0,0,0,0,0,2,head,0,...,2,139084,0,,0,0,0,0,0,https://www.emag.ro/resigilate/placi_video/c?r...


## Test
After extracting the above code into a standalone python file, we can test that it's implemented correctly by running it on a few pages.

In [14]:
import sys  # append src to the path
import os
sys.path.append(os.path.join(os.getcwd(), "../src"))

from features import extract_features_from_df

In [15]:
# test it
extract_features_from_df(df[:2], 2, 2)

Unnamed: 0,depth,sibling_pos,tag,no_classes,id_len,class_len,no_children,text_len,classes,descendant1_no_nodes,...,ancestor2_sibling_pos,ancestor2_tag,ancestor2_no_classes,ancestor2_id_len,ancestor2_class_len,ancestor2_no_children,ancestor2_text_len,ancestor2_classes,path,url
0,1,0,html,0,0,0,2,139084,[],2,...,0,,0,0,0,0,0,[],/html,https://www.emag.ro/resigilate/placi_video/c?r...
1,2,0,head,0,0,0,38,2379,[],38,...,0,,0,0,0,0,0,[],/html/head,https://www.emag.ro/resigilate/placi_video/c?r...
2,3,0,meta,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/meta[1],https://www.emag.ro/resigilate/placi_video/c?r...
3,3,1,title,0,0,0,0,40,[],0,...,0,html,0,0,0,2,139084,[],/html/head/title,https://www.emag.ro/resigilate/placi_video/c?r...
4,3,2,meta,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/meta[2],https://www.emag.ro/resigilate/placi_video/c?r...
5,3,3,meta,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/meta[3],https://www.emag.ro/resigilate/placi_video/c?r...
6,3,4,meta,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/meta[4],https://www.emag.ro/resigilate/placi_video/c?r...
7,3,5,link,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/link[1],https://www.emag.ro/resigilate/placi_video/c?r...
8,3,6,meta,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/meta[5],https://www.emag.ro/resigilate/placi_video/c?r...
9,3,7,meta,0,0,0,0,0,[],0,...,0,html,0,0,0,2,139084,[],/html/head/meta[6],https://www.emag.ro/resigilate/placi_video/c?r...
