# DOM features experiment. Coninuation
We will be picking off where we left last time. We'll try to do redo the experiments and see the results, but now we'll try to classify the entire dataset as well, both after having it traind on one website, and all of them.

In [1]:
%matplotlib inline

# standard library
import itertools
import sys, os
import re
import glob
import logging

from urllib.parse import urlparse

# pandas
import pandas as pd
import dask.dataframe as dd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# skelearn
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

# local imports
sys.path.append(os.path.join(os.getcwd(), "../src"))
from utils import get_domain_from_url
from experiments import simple_model_experiment, get_dataset_descr_from_filename, rf_eval

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')



Now thatwe have the scaffolding for the experiments, we can define functions to run tour five experiments in terms of them, with a given model. We will define all the experiments in term of the `simple_model_experiment` function. For each one we will create the decription.

In [2]:
label_cols = ['detail_description_label', 'detail_image_label', 'detail_price_label',
              'detail_title_label', 'list_image_label', 'list_price_label', 'list_title_label']

For the first experiment, we will only use the the csvs with apges of a website containing a label, for both test and train.

In [3]:
# describe the experiments
train_dataset_files = glob.glob('../data/ecommerce-new/final/split-label/*.csv')
train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
train_file_df = pd.DataFrame(data=train_datasets, columns=('file', 'website', 'label'))

# merge it with self 
experiments_df = train_file_df.merge(train_file_df, left_index=True, right_index=True, suffixes=('_train', '_test'))
experiments_df.head()  # inspect the experiments

Unnamed: 0,file_train,website_train,label_train,file_test,website_test,label_test
0,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,detail_description_label,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,detail_description_label
1,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_description_label,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_description_label
2,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_image_label,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_image_label
3,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_price_label,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_price_label
4,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_title_label,../data/ecommerce-new/final/split-label/lajuma...,lajumate.ro,detail_title_label


In [4]:
first_experiment_df = experiments_df.copy()  # persist it

For the second one, the tesing set will be the entire website.

In [18]:
train_dataset_files = glob.glob('../data/ecommerce-new/final/split-label/*.csv')
test_dataset_files = glob.glob('../data/ecommerce-new/final/split-url/*.csv')

train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
test_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in test_dataset_files]

# we need to air the label/website files with their website equivalent
train_file_df = pd.DataFrame(data=train_datasets, columns=('file', 'website', 'label'))
test_file_df = pd.DataFrame(data=test_datasets, columns=('file', 'website', 'label'))

# join them on the same website, with the proper suffixes
experiments_df = train_file_df.merge(test_file_df, on='website',  suffixes=('_train', '_test'))
experiments_df['website_train'] = experiments_df['website_test'] = experiments_df['website']
experiments_df.head()  # inspect the experiments

Unnamed: 0,file_train,website,label_train,file_test,label_test,website_train,website_test
0,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,detail_description_label,../data/ecommerce-new/final/split-url/www.emag...,all,www.emag.ro,www.emag.ro
1,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,detail_image_label,../data/ecommerce-new/final/split-url/www.emag...,all,www.emag.ro,www.emag.ro
2,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,detail_price_label,../data/ecommerce-new/final/split-url/www.emag...,all,www.emag.ro,www.emag.ro
3,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,detail_title_label,../data/ecommerce-new/final/split-url/www.emag...,all,www.emag.ro,www.emag.ro
4,../data/ecommerce-new/final/split-label/www.em...,www.emag.ro,list_image_label,../data/ecommerce-new/final/split-url/www.emag...,all,www.emag.ro,www.emag.ro


In [19]:
second_experiment_df = experiments_df.copy()

For the third one, the train and test are both on the entire website.

In [7]:
# describe the experiments
train_dataset_files = glob.glob('../data/ecommerce-new/final/split-url/*.csv')
train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
train_file_df = pd.DataFrame(data=train_datasets, columns=('file', 'website', 'label'))

# merge it with self 
experiments_df = train_file_df.merge(train_file_df, left_index=True, right_index=True, suffixes=('_train', '_test'))
experiments_df.head()  # inspect the experiments

Unnamed: 0,file_train,website_train,label_train,file_test,website_test,label_test
0,../data/ecommerce-new/final/split-url/lajumate...,lajumate.ro,all,../data/ecommerce-new/final/split-url/lajumate...,lajumate.ro,all
1,../data/ecommerce-new/final/split-url/www.alie...,www.aliexpress.com,all,../data/ecommerce-new/final/split-url/www.alie...,www.aliexpress.com,all
2,../data/ecommerce-new/final/split-url/www.amaz...,www.amazon.com,all,../data/ecommerce-new/final/split-url/www.amaz...,www.amazon.com,all
3,../data/ecommerce-new/final/split-url/www.emag...,www.emag.ro,all,../data/ecommerce-new/final/split-url/www.emag...,www.emag.ro,all
4,../data/ecommerce-new/final/split-url/www.okaz...,www.okazii.ro,all,../data/ecommerce-new/final/split-url/www.okaz...,www.okazii.ro,all


In [8]:
third_experiment_df = experiments_df.copy()

The fourth one is trained on an entire website and tested on all of them

In [9]:
train_dataset_files = glob.glob('../data/ecommerce-new/final/split-url/*.csv')
train_datasets = [(file, ) + get_dataset_descr_from_filename(file) for file in train_dataset_files]
train_file_df = pd.DataFrame(data=train_datasets, columns=('file_train', 'website_train', 'label_train'))

experiments_df = train_file_df
experiments_df['file_test'] = '../data/ecommerce-new/final/split-url/*.csv'
experiments_df['website_test'] = experiments_df['label_test'] = 'all'
               
experiments_df.head()

Unnamed: 0,file_train,website_train,label_train,file_test,website_test,label_test
0,../data/ecommerce-new/final/split-url/lajumate...,lajumate.ro,all,../data/ecommerce-new/final/split-url/*.csv,all,all
1,../data/ecommerce-new/final/split-url/www.alie...,www.aliexpress.com,all,../data/ecommerce-new/final/split-url/*.csv,all,all
2,../data/ecommerce-new/final/split-url/www.amaz...,www.amazon.com,all,../data/ecommerce-new/final/split-url/*.csv,all,all
3,../data/ecommerce-new/final/split-url/www.emag...,www.emag.ro,all,../data/ecommerce-new/final/split-url/*.csv,all,all
4,../data/ecommerce-new/final/split-url/www.okaz...,www.okazii.ro,all,../data/ecommerce-new/final/split-url/*.csv,all,all


In [10]:
fourth_experiment_df = experiments_df.copy()

Finally, the last one is trained on all the websites and tested on them as well.

In [11]:
fifth_experiment_df = pd.DataFrame(data={'file_train': '../data/ecommerce-new/final/split-url/*.csv', 
                                         'file_test':'../data/ecommerce-new/final/split-url/*.csv',
                                         'website_train': 'all', 'website_test': 'all', 
                                         'label_train': 'all', 'label_test': 'all'}, index=[0])
fifth_experiment_df

Unnamed: 0,file_test,file_train,label_test,label_train,website_test,website_train
0,../data/ecommerce-new/final/split-url/*.csv,../data/ecommerce-new/final/split-url/*.csv,all,all,all,all


## Experiment running
Now that we have the datasets to run the experiments on, we can finally run the experiment on each of them respectively. We will save the results in acouple a dataframes which we will persist into a csv.

### Train/test on website subset

In [None]:
first_results_df = simple_model_experiment(map(lambda x: x[1], first_experiment_df.iterrows()), model_func=rf_eval, 
                                           experiment_name='first-random-forest', label_cols=label_cols)

### Train on website subset. Test on whole website

In [None]:
second_results_df = simple_model_experiment(map(lambda x: x[1], second_experiment_df.iterrows()), model_func=rf_eval, 
                                            experiment_name='second-random-forest', label_cols=label_cols)

### Train/test on whole website

In [None]:
third_results_df = simple_model_experiment(map(lambda x: x[1], third_experiment_df.iterrows()), model_func=rf_eval, 
                                           experiment_name='third-random-forest', label_cols=label_cols)

### Train on single website. Test on all

In [None]:
fourth_results_df = simple_model_experiment(map(lambda x: x[1], fourth_experiment_df.iterrows()), model_func=rf_eval, 
                                            experiment_name='fourth-random-forest', label_cols=label_cols)

In [46]:
fourth_results_df['experiment'] = 'fourth-random-forest'

### Saving the results
Ths notebook is a little too crowded to do any proper analysis here. Memory is also pretty low as the experiments are fairly expensive. In order to mitigate the problem, we will save the experiment results and analyze them in a different notebook.

In [48]:
expermients = [first_results_df, second_results_df, third_results_df, fourth_results_df]
pd.concat(expermients, ignore_index=True).to_csv('../data/experimental-results/first-experiments.csv')