# Result interpertation
In the previous notebook we ran our experiments on our datasets using a random forest ensamble classifier, but we did not explore them. In this one we will take a closer look at the results of each individual pair of datasets(5 in total) and see what they mean.

## Experimental design
The full experimental design is described in the [main notes](../notes/Semantic-Data-From-Websites-Using-Deep-Learning.md) document, but here is a short description. The data contains labeled data from 7 different e-commerce websites. The data is split into html tags which are groupped into 8 classes, denoting their semantic relevance:
* list_image
* list_price
* list_title
* detail_image
* detail_title
* detail_description
* detail_price
* noise

`noise` is found on all pages, but the others are not always found on all pages of a website. Therefore, the dataset was split into subsets. The splitting was done on a per-website basis and whether the pages contained or not one of the classes mentioned above. Here, we get the three sets of subsets:
1. set of datasets containing pages of one website that contain at least a tag with a certain class
2. set of datasets containing pages of one website
3. the entire dataset
**NOTE:**For each website we have at least 10 oages containing at least one of the tags and 10 pages containing only `noise`.

Due to the relative invairance of tags on a website(proven in [7-dom-model](./7-dom-model)), we risk the model having seen the data it's tested on. Therefore, after training on one of the subsets, we will not only test on it but also on a subset one level higher(model trained on 1 is tested both on 1 and 2). This way we can both check the model's generalization power and wether sought classes follow a similar distribution across the same website and across multiple.

In [1]:
%matplotlib inline

# standard library
import itertools
import sys, os
import re
import glob
import logging

from urllib.parse import urlparse

# pandas
import pandas as pd
import dask.dataframe as dd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

## Loading the data

In [4]:
df = pd.read_csv('../data/experimental-results/first-experiments.csv', index_col=0)
df.head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
0,1.0,1.0,1.0,2.0,detail_description_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
1,1.0,1.0,1.0,4.0,detail_image_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
2,1.0,1.0,1.0,2.0,detail_price_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
3,1.0,1.0,1.0,2.0,detail_title_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
4,0.0,0.0,0.0,0.0,list_image_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest


## Interpretation
Now that we have the data we can see the overall mean performane for each of the experiments and see where the model performed worst. The table contains classification reports for each experiment. `train_website1` and `test_website` indicate the website of the train and the test datasets and the `pages_label` tell the classes contained in the dataset. `label` indicates for which of the classes the classification results are.

### Train/Test on website subset 

In [8]:
# select only the relevant experiment
first_experiment_df = df[df['experiment'] == 'first-random-forest']
first_experiment_df.head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
0,1.0,1.0,1.0,2.0,detail_description_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
1,1.0,1.0,1.0,4.0,detail_image_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
2,1.0,1.0,1.0,2.0,detail_price_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
3,1.0,1.0,1.0,2.0,detail_title_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
4,0.0,0.0,0.0,0.0,list_image_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest


These are fine, but as we can see, because we train/test on only a subset of pages that may or may not contain some of the classes, we will only select prediction results for the classses they weere definitely tested on. 

In [9]:
# check the means only or relevant data
first_experiment_df = first_experiment_df.query('label == train_pages_label')
first_experiment_df.head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
0,1.0,1.0,1.0,2.0,detail_description_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
8,1.0,1.0,1.0,4.0,detail_description_label,lajumate.ro,lajumate.ro,detail_description_label,detail_description_label,first-random-forest
17,1.0,1.0,1.0,21.0,detail_image_label,lajumate.ro,lajumate.ro,detail_image_label,detail_image_label,first-random-forest
26,1.0,1.0,1.0,3.0,detail_price_label,lajumate.ro,lajumate.ro,detail_price_label,detail_price_label,first-random-forest
35,1.0,1.0,1.0,3.0,detail_title_label,lajumate.ro,lajumate.ro,detail_title_label,detail_title_label,first-random-forest


In [10]:
# mean
first_experiment_df.mean()

precision     1.000000
recall        0.974180
f1-score      0.981961
support      81.076923
dtype: float64

The results are very high, as it is to be expected. However, ths probably due to the fact, that, as seen in the exploratory data analysis, the classes' variation is negligeable inside a website and the model has probably already seen all the possible values for that lass in training.

In [12]:
first_experiment_df.sort_values('f1-score').head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
200,1.0,0.333333,0.5,3.0,detail_description_label,www.okazii.ro,www.okazii.ro,detail_description_label,detail_description_label,first-random-forest
123,1.0,0.666667,0.8,3.0,detail_title_label,www.amazon.com,www.amazon.com,detail_title_label,detail_title_label,first-random-forest
189,1.0,0.993007,0.996491,143.0,list_price_label,www.emag.ro,www.emag.ro,list_price_label,list_price_label,first-random-forest
0,1.0,1.0,1.0,2.0,detail_description_label,www.emag.ro,www.emag.ro,detail_description_label,detail_description_label,first-random-forest
180,1.0,1.0,1.0,88.0,list_image_label,www.emag.ro,www.emag.ro,list_image_label,list_image_label,first-random-forest


The worse results are seen for the detail description label on the site okazii. If we look at the boxplots from the last notebook, we can confirm that that particular combination of class and website has the greates variance for its features(the longest IQRs in the picture).
![boxplot](./imgs/boxplot.png)

### Train on website subset. Test on whole website

In [13]:
# select only the relevant experiment
second_experiment_df = df[df['experiment'] == 'second-random-forest']
second_experiment_df.head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
312,1.0,1.0,1.0,7.0,detail_description_label,www.emag.ro,www.emag.ro,detail_description_label,all,second-random-forest
313,1.0,1.0,1.0,18.0,detail_image_label,www.emag.ro,www.emag.ro,detail_description_label,all,second-random-forest
314,1.0,1.0,1.0,9.0,detail_price_label,www.emag.ro,www.emag.ro,detail_description_label,all,second-random-forest
315,1.0,1.0,1.0,10.0,detail_title_label,www.emag.ro,www.emag.ro,detail_description_label,all,second-random-forest
316,0.0,0.0,0.0,295.0,list_image_label,www.emag.ro,www.emag.ro,detail_description_label,all,second-random-forest


In [14]:
# check the means only or relevant data
second_experiment_df = second_experiment_df.query('label == train_pages_label')
second_experiment_df.mean()

precision      0.959145
recall         0.996485
f1-score       0.972790
support      293.461538
dtype: float64

The overall results are just as high meaning that the model can learn the structure of the desired classes from just the pages they appear on. The noise is probably infered through exclusion. This means that in-site generalization is very possible and could provie a good tool for webscraping. 

Future experimental work, could explore, what is the lower threshold of pages to feed to the training model to be able to generalize to the entire website. This could prove useful in reducing the size of future datasets which should cointain a larger number of websites.

** NOTE: ** This is *better than XPath* performance

### Train/test on whole website

In [16]:
third_experiment_df = df[df['experiment'] == 'third-random-forest']
third_experiment_df.head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
624,1.0,1.0,1.0,5.0,detail_description_label,lajumate.ro,lajumate.ro,all,all,third-random-forest
625,1.0,1.0,1.0,21.0,detail_image_label,lajumate.ro,lajumate.ro,all,all,third-random-forest
626,1.0,1.0,1.0,4.0,detail_price_label,lajumate.ro,lajumate.ro,all,all,third-random-forest
627,1.0,1.0,1.0,4.0,detail_title_label,lajumate.ro,lajumate.ro,all,all,third-random-forest
628,0.0,0.0,0.0,0.0,list_image_label,lajumate.ro,lajumate.ro,all,all,third-random-forest


For this experiment all classes should potentially be learned by the model, however some websites do not contain some of the classes mainly due to them not being rendered by javascript. We will only select those with a support > 0.

In [17]:
third_experiment_df.query('support > 0').mean()

precision       0.995650
recall          0.992272
f1-score        0.993293
support      4461.217391
dtype: float64

The precision is even higer, because, now not only has the model seen enough of the semantic classes, but also probably of `noise` as well.

### Train on whole website. Test on all of them

In [18]:
fourth_experiment_df = df[df['experiment'] == 'fourth-random-forest']
fourth_experiment_df.head()

Unnamed: 0,precision,recall,f1-score,support,label,train_website,test_website,train_pages_label,test_pages_label,experiment
680,1.0,0.288136,0.447368,59.0,detail_description_label,lajumate.ro,all,all,all,fourth-random-forest
681,0.542636,0.47619,0.507246,147.0,detail_image_label,lajumate.ro,all,all,all,fourth-random-forest
682,1.0,0.241935,0.38961,62.0,detail_price_label,lajumate.ro,all,all,all,fourth-random-forest
683,1.0,0.186667,0.314607,75.0,detail_title_label,lajumate.ro,all,all,all,fourth-random-forest
684,0.0,0.0,0.0,2901.0,list_image_label,lajumate.ro,all,all,all,fourth-random-forest


In [19]:
fourth_experiment_df.mean()

precision        0.807682
recall           0.249689
f1-score         0.322998
support      85505.000000
dtype: float64

Very high precission, but small recall. This is probabbly due to the fact that the model guesses classes from its training website very acurately, but fails to regocnize any other. A better model, wouldn't probably do much to the generalization, what would however is having data from more websites to be able to identify any emerging patterns for data with similar semantic data.

As it is now, due to the low variance of that inside a website, the dataset is far too small to accurately represent the distribution of *semantic* data with these features.

## Conclusion
Dom data appears to be powerful enough to identify data within the same website, and has practical applications, thererfor. Further experimentation befor expanding th dataset and adding visual features will include finding the **generalization threshold** mentioned above and explore how many dom features are actually necessary for acurate prediction(their description can be seen in apst notebooks), however the dimensionlity may be reduced(there might be Markov chain-like patterns arising that make the neighbourghood redundant - just a hypothesis).

Overall the results are as they were to be expected, showing a ML dom-based model has the capability to identify semantic data more accurately and easier inside a website than it is with tailor-made XPaths.