# Crawl Data Analysis: Proprocessing

This notebook preprocesses our web crawl data, and makes it ready to use. It was written for Python 2.7. Note that you will have to run this twice, once for each database (odin & webtap). Please adjust the input and output file names accordingly.



@NOTE: Rewriten into Python3 (Petr Hanzl)

In [2]:
from __future__ import print_function
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import os

## Read from database

Read the crawl data from the database. Here we read in the `site_visits` and `segments` tables and join them.

In [3]:
import sqlite3
import pandas as pd

#db = '/mnt/ssd/amathur/dark-patterns-databases/odin-product-pages.sqlite'
db = '/home/xhanpet/20210615-022920_segmentation_pilot/20210615-022920_segmentation_pilot.sqlite'
con = sqlite3.connect(db)
site_visits = pd.read_sql_query('''SELECT * from site_visits''', con)

In [4]:
print('Number of site visits: %s' % str(site_visits.shape))
print('site_visits columns: %s' % str(list(site_visits.columns.values)))

Number of site visits: (957, 3)
site_visits columns: ['visit_id', 'crawl_id', 'site_url']


## Pull the segment data out using stream processing

In [6]:
from urllib.parse import urlparse
from collections import defaultdict
import binascii
import json
from tqdm import tqdm
import re

con = sqlite3.connect(db)
con.row_factory = sqlite3.Row
cur = con.cursor()

query = """SELECT sv.site_url, sv.visit_id,
    sg.id, sg.node_name, sg.node_id, sg.top, sg.left, sg.width, sg.height, 
    sg.num_buttons, sg.num_imgs, sg.num_anchors,
    TRIM(sg.inner_text) as inner_text, TRIM(sg.longest_text) as longest_text
    FROM segments as sg LEFT JOIN site_visits as sv
    ON sv.visit_id = sg.visit_id WHERE
    LOWER(sg.node_name) <> 'body' AND TRIM(sg.inner_text) <> ''
    """

In [1]:
#segment_json = '/mnt/ssd/amathur/dark-patterns-output/segments_odin.json'
segment_json = '/home/xhanpet/analysis-data/segments.json'

In [8]:
try:
    os.remove(segment_json)
    print ('Removed %s ' % segment_json)
except:
    pass

seen_checksums = defaultdict(set)
seen_host_checksums = set()

with open(segment_json, "a") as f:
    for row in tqdm(cur.execute(query)):
        inner_processed = row['inner_text'].replace('\n', ' ').replace('\r', '')
        inner_processed = re.sub(r'\d+', 'DPNUM', inner_processed)
        hostname = urlparse(row['site_url']).hostname
        inner_processed_crc = binascii.crc32(inner_processed.encode('utf-8'))
        if (hostname, inner_processed_crc) in seen_host_checksums:
            continue
        seen_host_checksums.add((hostname, inner_processed_crc))
        # if inner_processed_crc in seen_checksums[hostname]:
        #    continue
        # seen_checksums[hostname].add(inner_processed_crc)
        row_d = dict(row)
        row_d['inner_text_processed'] = inner_processed
        row_d['hostname'] = hostname
        json.dump(row_d, f)
        f.write('\n')

522it [00:00, 5218.86it/s]

Removed /home/xhanpet/segments-output/segments.json 


168143it [00:05, 32684.98it/s]


In [9]:
! wc -l /home/xhanpet/analysis-data/segments.json

23151 /home/xhanpet/segments-output/segments.json
