# Navigation Recognition

The idea is on given element from the DOM tree of a webpage to determine if this particular element is the main navigation of the page. By main navigation we define the only one navigation with the main section or category links. Every page has a main navigation and can contains multiple menus, context navigations and so on. We are trying to determine onlt the main navigation.

## Linear regression statistic model

Using weights on element DOM information we calculate probability. After that we build a linear model from the element based on accessibility roles, HTML tag, class names, ids and other components that we will cover later.

In [18]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import mglearn
from IPython.display import display

%matplotlib notebook

import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")
warnings.filterwarnings(action="ignore", module="sklearn", message="^Objective did not")

In [19]:
df = pd.read_csv("linear-regression-input.csv", header=0)
df = df.drop_duplicates(subset=['url', 'tag', 'class', 'class_count', 'id', 'role', 'home', 'depth', 'links', 'links_by_depth'], keep='first')

Y = df['probability']
X = df[['tag', 'class', 'id', 'role', 'home', 'depth', 'links', 'links_by_depth']]

## Understand the model

The linear model of every element is based on several criteria

### HTML tag

| tag    	| weight    | comment 	|
| --------	| --------	| ---------	|
| `nav`   	| **3**    	| we are sure that this element holds some navigation 	|
| `div`    	| **2**    	| they are many containers holding the navigation itself 	|
| `header` 	| **1**    	| it is common to have navigation in the header 	|
| other  	| **0**    	| we are neutral here, cannot tell 	|
| `footer` 	| **-1**   	| if it is footer, we really don't care of it 	|

### Element classes and ids

| weight 	| keywords            	| comment                                                                               	|
|--------	|---------------------	|---------------------------------------------------------------------------------------	|
| **2**  	| `navigation`, `nav` 	| we assume that this can be navigation or container for sure if these keywords appears 	|
| **1**  	| `menu`, `main`      	| we are not sure if this can be main navigation but could be in some corner cases      	|
| **-1** 	| `footer`            	| keywords indicating that we are not interested in this element for sure               	|

### Accessibility roles

| role                         	| priority 	| comment                                          	|
|------------------------------	|----------	|--------------------------------------------------	|
| `navigation`                 	| **high** 	| for sure this is the main navigation on the page 	|
| `menu`, `menubar`, `toolbar` 	| medium   	| menus, context covigations and etc in most cases  |
| `contentinfo`                	| low      	| common used for the footer                       	|

### Other
- `depth` - Depth of the element in the DOM tree
- `home` - Whatever if element has links and some of them points to the home ('/')
- `links` - if the element contains links
- `links_by_depth` - if more than half of the links are at same depth on the DOM tree

## Probability
We calculate some probability using weights on the same criteria for every element. Using this probability we try to find a successful rate of the linear regression model.

In [20]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)  

model = LinearRegression()

model.fit(X_train, y_train) 

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

## Coefficients

Using our data set lets examine that are the coefficient that every criteria participates. For example you will se that `depth` coefficient is negative, but if we train the model without it the score drop with 0.01% This leads me to the conclustion that element deeper in the DOM tree are less possible to be navigation.

In [21]:
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])  
coeff_df

Unnamed: 0,Coefficient
tag,0.054247
class,0.192798
id,0.073258
role,0.268792
home,0.354699
depth,-0.001822
links,0.073681
links_by_depth,0.197675


## Score of the current model

In [22]:
model.score(X, Y)

0.9337631753170562

In [15]:
y_pred = model.predict(X_test) 
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})  
df

Unnamed: 0,Actual,Predicted
2840,0.15,0.158407
311,0.40,0.433435
301,0.60,0.488262
18,0.40,0.419673
1925,0.15,0.154161
239,0.15,0.153481
1439,0.15,0.156538
2540,0.15,0.161127
1588,0.40,0.433435
1036,0.25,0.240996


In [16]:
from sklearn import metrics  
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) 

Mean Absolute Error: 0.044615374107410344
Mean Squared Error: 0.006399412940528507
Root Mean Squared Error: 0.07999633079415897


## Testing

Using https://www.rheinische-anzeigenblaetter.de/ our library recognized successfully the page header, footer and navigation with the following internal probability recognized as the main navigation:

- `navigation` (1.4) with model `"3,2,0,0,1,4,1,1"`
- `header` (0.6) with model `"1,0,0,0,1,3,1,1"`
- `footer` (0.25) with model `"-1,-1,0,0,0,3,1,1"`

Now lets check how our linear regression will score with these models

In [17]:
array = np.array([[3,2,0,0,1,4,1,1], [1,0,0,0,1,3,1,1], [-1,-1,0,0,0,3,1,1]])
model.predict(array)

ValueError: shapes (3,8) and (9,) not aligned: 8 (dim 1) != 9 (dim 0)

## Results

As we see the linear regression model predicts on random data pretty nice. We can assume from the results that the navigation is inside the header. Our library also confirm this.