# Dragnet Conversion 
In this notebook we will be fetching the Dragnet data and converting it to our CSV format. We will also be outputting the labels alongside in a `url/path` format as used before. It being a binary classification problem, we will only have one class.

## Fetching the data
This step is pretty straightforward, as the seomoz guys have hosted the data on their github and we basically just have to clone it and untar an archive.

In [1]:
%%bash 
cd ../data/external/

# clone and untar
git clone https://github.com/seomoz/dragnet_data.git
cd dragnet_data
tar xvf dragnet_HTML.tar.gz
tar xvf dragnet_Corrected.tar.gz

# copy the cetr data as well
wget https://www3.nd.edu/~tweninge/cetr/cetr-dataset.zip
unzip cetr-dataset.zip -d cetr-dataset
chmod +x cetr_to_dragnet.sh
./cetr_to_dragnet.sh cetr-dataset > /dev/null

# move it to the other directory and remove the junk
cd ..
mkdir dragnet 
mkdir cleaneval
mv -t dragnet dragnet_data/{HTML,Corrected}
mv -t cleaneval dragnet_data/cetr-dataset/cleaneval/en/{Corrected,HTML}

rm -r dragnet_data

HTML/
HTML/._.DS_Store
HTML/.DS_Store
HTML/100.html
HTML/102.html
HTML/103.html
HTML/106.html
HTML/108.html
HTML/109.html
HTML/110.html
HTML/113.html
HTML/116.html
HTML/117.html
HTML/119.html
HTML/120.html
HTML/128.html
HTML/131.html
HTML/134.html
HTML/135.html
HTML/137.html
HTML/14.html
HTML/140.html
HTML/141.html
HTML/142.html
HTML/143.html
HTML/145.html
HTML/147.html
HTML/149.html
HTML/15.html
HTML/151.html
HTML/152.html
HTML/153.html
HTML/158.html
HTML/160.html
HTML/161.html
HTML/167.html
HTML/168.html
HTML/170.html
HTML/172.html
HTML/175.html
HTML/176.html
HTML/186.html
HTML/188.html
HTML/19.html
HTML/190.html
HTML/191.html
HTML/192.html
HTML/196.html
HTML/198.html
HTML/199.html
HTML/20.html
HTML/201.html
HTML/205.html
HTML/210.html
HTML/212.html
HTML/217.html
HTML/220.html
HTML/222.html
HTML/223.html
HTML/227.html
HTML/230.html
HTML/232.html
HTML/233.html
HTML/236.html
HTML/238.html
HTML/239.html
HTML/242.html
HTML/244.html
HTML/248.html
HTML/250.html
HTML/252.html
HTML/257.html


Cloning into 'dragnet_data'...
--2017-09-23 18:34:57--  https://www3.nd.edu/~tweninge/cetr/cetr-dataset.zip
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving www3.nd.edu... 34.228.251.114, 52.87.65.42
Connecting to www3.nd.edu|34.228.251.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 61526466 (59M) [application/zip]
Saving to: ‘cetr-dataset.zip’

     0K .......... .......... .......... .......... ..........  0%  376K 2m40s
    50K .......... .......... .......... .......... ..........  0%  289K 3m4s
   100K .......... .......... .......... .......... ..........  0%  376K 2m55s
   150K .......... .......... .......... .......... ..........  0%  288K 3m3s
   200K .......... .......... .......... .......... ..........  0% 5.17M 2m29s
   250K .......... .......... .......... .......... ..........  0%  382K 2m30s
   300K .......... .......... .......... .......... ..........  0%  297K 2m37s
   350K .......... .......... .......... .........

After further investigation, there seem to be some inconsistencies regarding the encoding of the files(a few of them aren't UTF-8)  . We are going to solve that with a small bash snippet that converts the inconsistent files and remove the ones which are impossible to convert.

In [2]:
%%bash
cd ../data/external/
# convert both txt and html files
for f in $(find . -name "*.txt" -o -name "*.html"); do  
        encoding=$(file -i $f | cut -d"=" -f 2)  # get the mime encoding
        if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
                res=$(chardetect $f)  # try to detect it otherwise
                encoding=$(echo $res | cut -d" " -f 2)
                echo $res - CONVERTING TO UTF-8
                recode ${encoding}..utf-8 $f
        fi
done

# remove the unsolvable ones
cd dragnet
rm HTML/{R121,T19,T2,T31}.html Corrected/{R121,T19,T2,T31}.html.corrected.txt
cd ../cleaneval
rm HTML/{114,276,305,376,767,331,619,716}.html Corrected/{114,276,305,376,767,331,619,716}.html.corrected.txt

./cleaneval/Corrected/15.html.corrected.txt: Windows-1252 with confidence 0.7246884564300976 - CONVERTING TO UTF-8
./cleaneval/Corrected/250.html.corrected.txt: Windows-1252 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/212.html.corrected.txt: ISO-8859-1 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/215.html.corrected.txt: ISO-8859-1 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/216.html.corrected.txt: Windows-1252 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/217.html.corrected.txt: ISO-8859-1 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/219.html.corrected.txt: ISO-8859-1 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/221.html.corrected.txt: ISO-8859-1 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/222.html.corrected.txt: Windows-1252 with confidence 0.73 - CONVERTING TO UTF-8
./cleaneval/Corrected/223.html.corrected.txt: Windows-1252 with confidence 0.73

recode: ./cleaneval/Corrected/716.html.corrected.txt failed: Invalid input in step `CP949..UTF-8'
recode: ./cleaneval/HTML/114.html failed: Untranslatable input in step `CP1254..ISO-10646-UCS-2'
recode: ./cleaneval/HTML/276.html failed: Invalid input in step `CP949..UTF-8'
recode: ./cleaneval/HTML/305.html failed: Invalid input in step `CP949..UTF-8'
recode: ./cleaneval/HTML/376.html failed: Untranslatable input in step `CP1254..ISO-10646-UCS-2'
recode: ./cleaneval/HTML/767.html failed: Untranslatable input in step `CP1252..ISO-10646-UCS-2'
recode: ./dragnet/HTML/R121.html failed: Untranslatable input in step `CP1254..ISO-10646-UCS-2'
recode: ./dragnet/HTML/T19.html failed: Untranslatable input in step `CP1252..ISO-10646-UCS-2'
recode: ./dragnet/HTML/T2.html failed: Untranslatable input in step `CP1254..ISO-10646-UCS-2'
recode: ./dragnet/HTML/T31.html failed: Untranslatable input in step `CP1254..ISO-10646-UCS-2'


## Converting to CSV
Now that we have the html files, we can basically construct the csv, by setting the url column to be `file://filename.html` and the `html` column to the actual content of the file.

In [1]:
%matplotlib inline

# standard library
import os
import re
from difflib import SequenceMatcher

# lxml
from lxml import etree

# pandas
import pandas as pd
import dask.dataframe as dd

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# this styling is purely my preference
# less chartjunk
sns.set_context('notebook', font_scale=1.5, rc={'line.linewidth': 2.5})
sns.set(style='ticks', palette='Set2')

In [2]:
# get the html files 
html_files = [file for file in os.listdir("../data/external/dragnet/HTML") if file.endswith(".html")]
html_files[:5]  # inspect the first few

['R256.html', 'R581.html', '100.html', '102.html', '103.html']

In [3]:
# build the dataframe 
ddf = dd.from_pandas(data=pd.DataFrame({'file': html_files}), chunksize=10)
ddf.head()

Unnamed: 0,file
0,R256.html
1,R581.html
2,100.html
3,102.html
4,103.html


In [4]:
def read_dir_file(file, directory):
    """Read a file from a given directory"""
    with open(os.path.join(directory, file)) as fin:
        return fin.read()  # return the entire content

# test
read_dir_file('9.html', directory='../data/external/dragnet/HTML')[:20] 

'<!DOCTYPE html><html'

In [5]:
ddf['html'] = ddf['file'].apply(read_dir_file, meta=('html', str), directory='../data/external/dragnet/HTML')
ddf['url'] = 'file://' + ddf['file']  # convert it to the proper format
ddf.head()  # sanity-check

Unnamed: 0,file,html,url
0,R256.html,"\n<!DOCTYPE html>\n<html class=""site-creativit...",file://R256.html
1,R581.html,"<!DOCTYPE html>\n<html dir=""ltr"" lang=""en-US"">...",file://R581.html
2,100.html,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...",file://100.html
3,102.html,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S...",file://102.html
4,103.html,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S...",file://103.html


The results look good! We should be ready to persist them in a file

In [8]:
%%bash 
mkdir ../data/raw/dragnet  # make the folder if it doesn't exist
mkdir ../data/raw/cleaneval

In [6]:
# define a reusable function
def convert_dragnet_dataset(directory, prefix=''):
    """Returns a csv dataset from a dragnet one(or cleaneval).
    The urls are encoded as file://{suffix}{filename}"""
    html_dir = os.path.join(directory, "HTML")
    html_files = [file for file in os.listdir(html_dir) if file.endswith(".html")]
    
    ddf = dd.from_pandas(data=pd.DataFrame({'file': html_files}), chunksize=10)
    ddf['html'] = ddf['file'].apply(read_dir_file, meta=('html', str), directory=html_dir)
    ddf['url'] = 'file://' + prefix + ddf['file']  # convert it to the proper format
    return ddf

dragnet_ddf = convert_dragnet_dataset('../data/external/dragnet/', 'dragnet-')
cleaneval_ddf = convert_dragnet_dataset('../data/external/cleaneval/', 'cleaneval-')

Voila! We now have the dataset in our own, single CSV format. We can move on to extracting the labels. But before, we will be doing the same with the **CleanEval** dataset.

Now that we have the CSVs we can begin the labeling. 
## Labeling
Both the dragnet and the cleaneval dataset present the gold standard data as txt files with blocks that make up the text corpora. From these blocks we can identify what html tags they correspond to by doing a reverse fuzzy matching on each tag's text content. We can also find in some of them the `!@#$%^&*()  COMMENTS` line which separates comment blocks from the main content, so we could potentially add 3 classes. To conclude the **dragnet** ones look like this:


```
23 July 2012
8 great talks on war and peace

James Stavridis speaks at TED Global 2012

“Walls don’t work,” James Stavridis declared at TEDGlobal 2012. A highly accomplished Navy Admiral, Stavridis recalls 20th-century phenomena like trench warfare and the Berlin Wall. “Instead of building walls for security, we need to build bridges.”

In his brass-tacks talk, Stavridis lays down a vision of “open-source security,” which he defines as “connecting the international, the inter-agency, the private/public, and lashing it together with strategic communication.” The basic point: that to combat 21st-century threats like cybercrime, terrorism, trafficking and piracy, the military cannot work alone. Stavridis invokes the example of Wikipedia to make his point.

“Wikipedia is not created by 12 brilliant people locked in a room writing articles. Wikipedia every day is tens of thousands of people inputting information,” Stavridis says. “It’s the perfect image for the fundamental point that no one of us is as smart as all of us thinking together.”

Stavridis’ thoughts on security are surprising. After the jump, listen to 8 other TEDTalks that challenge you to think of war, peace and military life in new ways.

```

As we can seem, the blocks are separable by series of newlines. As for **cleaneval**, the gold standard is presented in this format:

```
   
<h>                           Clinical Applications

 <p>   The following applications are supported by the various divisions of
    Information Technology @ Johns Hopkins.  Please note that access to
     some of these applications is restricted to computers on the Johns
              Hopkins network and are not accessible off site.

  <h>                               Allscripts
<p>   Allscripts is a medication management solution enabling physicians to
   create electronic prescriptions.  The system provides physicians with
   important information at the point of prescribing, including formulary
       management, drug utilization reviews and other key indicators.
   Allscripts provides an automated physician charge entry tool which is
               used to capture charges at the point of care.

 <h>                         Applicant Tracking System
<p>   Tracks demographics and status of job applicants BDM Pharmacy  JHH The
      RxTFC system support the  Inpatient and Outpatient Clinic areas
     supported by the five JHH Pharmacy satellites. The Investigational
   Drug Department uses the system to track certain studies. The software
    manages 10,000 daily doses and billing between $100-200,000/day. The
   Pharmacy staff of 180 has been trained in the application according to
                                usage needs.
```

Slightly harder to parse, but all blocks start with an html tag and end with a blank line, or another tag. 

For easier evaluation of both the strings we will be collapsing whitespaces and fuzzy matching doesn't really care. We'll also replace all tabs with spaces for convenience.

In [7]:
def collapse_whitespace(strarg, remove_nl=False):
    """Returns a cleaned-up version of the block text
    It collapses  whitespaces, removes tabs, and, if specified,
    only keeps the tag delimiters(like in cleaneval) as newlines."""
    strarg = re.sub(r'\t+', ' ', strarg)  # replace tabs with spaces
    if remove_nl:
        # remove newlines, used for cleaneval
        # they will be replaced by the remove tags
        strarg = re.sub(r'\n', ' ', strarg)  
    strarg = re.sub(r'<[a-zA-Z]+>', '\n', strarg)
    strarg = re.sub(r' +',  ' ', strarg)  # collapse whitespace
    return strarg

def get_blocks(strarg, cleaneval=False):
    """Gets the sanitized blocks to use for fuzzy matching
    First sanitizes the entire text, then splits it, trims excessive
    whitespace and removes any null ones."""
    sanitized_str = collapse_whitespace(strarg, remove_nl=cleaneval)  # sanitize the string
    blocks = sanitized_str.split('\n')  # split each block of text
    stripped_blocks = (block.strip() for block in blocks)  # trim leading amnd trailing whitespace
    return [block for block in stripped_blocks if block]  # filter out nempty blocks

In [8]:
cleaneval_str = """
<h>                           Clinical Applications

 <p>   The following applications are supported by the various divisions of
    Information Technology @ Johns Hopkins.  Please note that access to
     some of these applications is restricted to computers on the Johns
              Hopkins network and are not accessible off site.

  <h>                               Allscripts
<p>   Allscripts is a medication management solution enabling physicians to
   create electronic prescriptions.  The system provides physicians with
   important information at the point of prescribing, including formulary
       management, drug utilization reviews and other key indicators.
   Allscripts provides an automated physician charge entry tool which is
               used to capture charges at the point of care.
"""

collapse_whitespace(cleaneval_str, True)

' \n Clinical Applications \n The following applications are supported by the various divisions of Information Technology @ Johns Hopkins. Please note that access to some of these applications is restricted to computers on the Johns Hopkins network and are not accessible off site. \n Allscripts \n Allscripts is a medication management solution enabling physicians to create electronic prescriptions. The system provides physicians with important information at the point of prescribing, including formulary management, drug utilization reviews and other key indicators. Allscripts provides an automated physician charge entry tool which is used to capture charges at the point of care. '

In [9]:
dragnet_str = """
23 July 2012
8 great talks on war and peace

James Stavridis speaks at TED Global 2012

“Walls don’t work,” James Stavridis declared at TEDGlobal 2012. A highly accomplished Navy Admiral, Stavridis recalls 20th-century phenomena like trench warfare and the Berlin Wall. “Instead of building walls for security, we need to build bridges.”
"""

collapse_whitespace(dragnet_str)

'\n23 July 2012\n8 great talks on war and peace\n\nJames Stavridis speaks at TED Global 2012\n\n“Walls don’t work,” James Stavridis declared at TEDGlobal 2012. A highly accomplished Navy Admiral, Stavridis recalls 20th-century phenomena like trench warfare and the Berlin Wall. “Instead of building walls for security, we need to build bridges.”\n'

In [10]:
get_blocks(cleaneval_str, cleaneval=True)

['Clinical Applications',
 'The following applications are supported by the various divisions of Information Technology @ Johns Hopkins. Please note that access to some of these applications is restricted to computers on the Johns Hopkins network and are not accessible off site.',
 'Allscripts',
 'Allscripts is a medication management solution enabling physicians to create electronic prescriptions. The system provides physicians with important information at the point of prescribing, including formulary management, drug utilization reviews and other key indicators. Allscripts provides an automated physician charge entry tool which is used to capture charges at the point of care.']

In [11]:
get_blocks(dragnet_str, cleaneval=False)

['23 July 2012',
 '8 great talks on war and peace',
 'James Stavridis speaks at TED Global 2012',
 '“Walls don’t work,” James Stavridis declared at TEDGlobal 2012. A highly accomplished Navy Admiral, Stavridis recalls 20th-century phenomena like trench warfare and the Berlin Wall. “Instead of building walls for security, we need to build bridges.”']

The code works well, now we have to apply it to all the html of all the pages, but first, we should associate with each one of them the blocks that need to be extracted. Afterwards we will need to extract all the tags in the typical `url, path` format and their containing text.

In [12]:
def get_blocks_for_file(filename, directory, cleaneval=False):
    """For the given filename(the html file), and the root directory of
    the dataset, return its list of blocks. Can specify wether the dataset is
    clean eval or not."""
    corrected_dir = os.path.join(directory, "Corrected")
    filename = filename + '.corrected.txt'
    with open(os.path.join(corrected_dir, filename)) as f:
        content = f.read()  # retrieve the content
        
    return get_blocks(content, cleaneval=cleaneval)  # sanitize and return

# test
get_blocks_for_file('1.html', '../data/external/cleaneval/', cleaneval=True)

["If you feel it's time to start taking action and get your property sold FAST and for TOP DOLLAR; here's a good place to start.",
 'Your Home Will Sell',
 '* Fast',
 'For Top Dollar',
 '* With the Least Amount of Hassle You will get these results because of our:',
 '1. Unique Team System',
 '2. Innovative Consumer Programs',
 '3. Leading Edge Technology',
 '4. Specialized Knowledge',
 "72% of homeowners are dissatisfied with their agent's performance",
 'Why?',
 'The Major Reason: Poor Communication',
 "The ABC's of Real Estate Marketing (What Most Realtors Do)",
 'Advertise themselves',
 'Bang a sign into your lawn',
 'Create an ad for the paper (and maybe run it)',
 'Download your listing to the MLS',
 'Encourage their office to show it',
 'Figure they might try an open house',
 'Get on their knees and pray it will sell',
 "This is the way real estate has been practiced for the past 100 years, and it's still the way many agents operate today, but...",
 '... these traditional methods

Now that we have the function we should be able to apply it to the entire dataset

In [13]:
dragnet_ddf['blocks'] = dragnet_ddf['file'].apply(get_blocks_for_file, directory='../data/external/dragnet/', 
                                                  cleaneval=False, meta=('blocks', object))
dragnet_ddf['blocks'].head()

0    [30 Designs to Inspire You Before Christmas, P...
1    [Backgrounds become foreground in election nig...
2    [July 24, 2012 12:11 AM, Latest Amelia Earhart...
3    [The Dangerous Hubris of the Startup World, by...
4    [Binomial coefficient trick, by John on July 2...
Name: blocks, dtype: object

In [14]:
cleaneval_ddf['blocks'] = cleaneval_ddf['file'].apply(get_blocks_for_file, directory='../data/external/cleaneval/', 
                                                  cleaneval=True, meta=('blocks', object))
cleaneval_ddf['blocks'].head()

0    [Watering Tricks by Lisa Erickson, Watering ca...
1    [If you feel it's time to start taking action ...
2    [Some VERY interesting information on 666, by ...
3    [Re: Mobile Multimedia Messaging Service, From...
4    ["Bugs" in the 24 March 1999 Version, From: er...
Name: blocks, dtype: object

We can now try to extract the tags and their respective text content.

In [15]:
def extract_text_from_html(html):
    """Given some html, return the" dataframe of
    paths and text contents.
    """
    # we will transfor the str to bytes, otherwise, lxml complains 
    # for some edge casse when the encoding is specified in the document
    root = etree.HTML(html.encode('utf-8'))  # get the nodes
    paths, texts = zip(*((node.getroottree().getpath(node), 
                          '' if node.tag is etree.Comment or node.tag is etree.PI else ''.join(node.itertext())) 
                         for node in root.iter()))
    return pd.DataFrame(data={'path': paths, 'text': texts})

html = dragnet_ddf['html'].head()[0]
extract_text_from_html(html).iloc[100:105]

Unnamed: 0,path,text
100,/html/body/div[2]/div/div[2]/div[2]/div[2]/div...,in Inspiration
101,/html/body/div[2]/div/div[2]/div[2]/div[2]/div...,Inspiration
102,/html/body/div[2]/div/div[2]/div[2]/div[2]/p[1],The holiday season is fast approaching and aro...
103,/html/body/div[2]/div/div[2]/div[2]/div[2]/h3[1],Season’s Greetings
104,/html/body/div[2]/div/div[2]/div[2]/div[2]/p[2],


In [16]:
def extract_text_from_df(df):
    """Given a dataframe of htmls and urls, return
    a dataframe of nodes and their text content.
    """
    grouped = df.groupby(level=0)[['html', 'url']]  # group by unique default index
    
    # apply receives each group as a Series if we are applying to a series
    # or as a Datafram in this case(with a single row)
    result = grouped.apply(lambda x: extract_text_from_html(x['html'].iat[0]).assign(url=x['url'].iat[0]))
    return result.reset_index(drop=True)  # drop the multiindex

extract_text_from_df(dragnet_ddf.head(2)).head()

Unnamed: 0,path,text,url
0,/html,\n\n\t\n\n\t30 Designs to Inspire You Before C...,file://dragnet-R256.html
1,/html/head,\n\t\n\n\t30 Designs to Inspire You Before Chr...,file://dragnet-R256.html
2,/html/head/meta[1],,file://dragnet-R256.html
3,/html/head/title,30 Designs to Inspire You Before Christmas | ...,file://dragnet-R256.html
4,/html/head/link[1],,file://dragnet-R256.html


In [17]:
def extract_text_from_ddf(ddf):
    """The same a s the df version, but works with
    dask dataframes instead."""
    # we basicaly abuse map_partition's ability to expand indexes for lack of a working
    # groupby(level) in dask
    return ddf.map_partitions(extract_text_from_df, meta={'path': str, 'text': str, 'url': str}).clear_divisions()

extract_text_from_ddf(dragnet_ddf).head()

Unnamed: 0,path,text,url
0,/html,\n\n\t\n\n\t30 Designs to Inspire You Before C...,file://dragnet-R256.html
1,/html/head,\n\t\n\n\t30 Designs to Inspire You Before Chr...,file://dragnet-R256.html
2,/html/head/meta[1],,file://dragnet-R256.html
3,/html/head/title,30 Designs to Inspire You Before Christmas | ...,file://dragnet-R256.html
4,/html/head/link[1],,file://dragnet-R256.html


Now that we have the text content we can correlate it to the blocks associated with the each one of the urls.

In [18]:
def block_max_ratio(ser, reference):
    """Receives a text and a series indexed by url
    in which to check for it.It returns the maximum fuzzy
    match ratio"""
    blocks = reference[ser['url']]  # get the blocks
    ratios = (SequenceMatcher(None, ser['text'], block).ratio() for block in blocks)  # do fuzzy matching for all
    return max(ratios, default=0.)  # return the maximum

# test
block_max_ratio(pd.Series(['a', 'as da '], index=['url', 'text']), pd.Series([['asadssad', 'ad'], ['werew']], index=['a', 'b']))

0.5714285714285714

In [21]:
def block_max_ratio(ser, reference=None):
    """Receives a text and a series indexed by url
    in which to check for it.It returns the maximum fuzzy
    match ratio"""
    blocks = ser['blocks'] # get the blocks
    ratios = (SequenceMatcher(None, ser['text'], block).ratio() for block in blocks)  # do fuzzy matching for all
    return max(ratios, default=0.)  # return the maximum


def convert_dataset(directory, prefix, cleaneval=False, block_thresh=0.9, label_name='content'):
    """Given a directory for a dragnet-style dataset, return
    the `url,html` and the label dataframe. Can specify the 
    fuzzy threshold for which to consider a text corresponding to the block
    and also wether the tdataset is or not CleanEval"""
    html_ddf = convert_dragnet_dataset(directory, prefix)  # get the htl content
    html_ddf['blocks'] = html_ddf['file'].apply(get_blocks_for_file, directory=directory, 
                                                cleaneval=cleaneval, meta=('blocks', object))
    text_tag_ddf = extract_text_from_ddf(html_ddf)
    text_tag_df = text_tag_ddf.compute()
    
    # persist it for the lookup speedup
    html_df = html_ddf.compute()

    # get the ratios
    ratios = text_tag_df.merge(html_df, on='url')[['text', 'blocks']] \
             .apply(block_max_ratio, axis=1)
    # assign the label to the tags that hae the grates matchin ration of over
    # value of the threshold
    text_tag_df[label_name + '_label'] = ratios >= block_thresh  
    
    # return the content and the labels
    return html_df[['url', 'html']], text_tag_df[['url', 'path', label_name + '_label']]

In [None]:
# convert the cleaneval dataset
cleaneval_df, cleaneval_label_df = convert_dataset('../data/external/cleaneval/', 'cleaneval-', cleaneval=True)

cleaneval_df.to_csv('../data/raw/cleaneval/raw.csv', index=False)
cleaneval_label_df.to_csv('../data/raw/cleaneval/labels.csv', index=False)

In [24]:
dragnet_df, dragnet_label_df = convert_dataset('../data/external/dragnet/', 'dragnet-', cleaneval=False)

dragnet_df.to_csv('../data/raw/dragnet/raw.csv', index=False)
dragnet_label_df.to_csv('../data/raw/dragnet/labels.csv', index=False)

## Conclusion
After all this ran, the results should be now toed in the specified locations. The big caveat with this approach is that it take an absurdl long time. **TODO** The implementation should somehow leverage dask's parallelized `apply`, as currently, it take 24h+ for the dragnet dataset.