# Innoplexus Challenge

The task of this challenge is to classify web-pages based on their text content.

There are 9 classes I must bin things into;

1. People profile
2. Conferences/Congress
3. Forums
4. News article
5. Clinical trials
6. Publication
7. Thesis
8. Guidelines
9. Others

### The Data
I have a train dataframe;

|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Domain|Domain|
|Url|Complete Url|
|Tag|(Target) Tag (Class) of the Web page|

I have an html_data dataframe;


|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Html|Web page data in HTML|

I have a test dataframe;

|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Domain|Domain|
|Url|Complete Url|

And I have sample submissions;

|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Tag|(Target) Tag (Class) of the Web page|

### Evaluation
At the end I will be evaluated based on f1 score.

### Approach
I will;
1. Mine the html_data set for features to populate my training data
2. Run a small TPOT run to determine a good choice for a model
3. Train
4. Predict
5. Submit!

# Data Mining

In [1]:
import pandas as pd
from tqdm import tqdm_notebook
data_dir = "../data/2018-08-10_AV_Innoplexus/"
html_data = pd.read_csv(data_dir+'html_data.csv',iterator=True, chunksize=1000)
sample_submission = pd.read_csv(data_dir+"sample_submission.csv",iterator=True, chunksize=1000)
train_df = pd.read_csv(data_dir+'train.csv')
test_df = pd.read_csv(data_dir+'test.csv')

In [2]:
print(train_df.shape)
train_df.sample(10)

(53447, 4)


Unnamed: 0,Webpage_id,Domain,Url,Tag
22954,34655,painresearchforum.org,http://painresearchforum.org/papers/papers-of-...,publication
16129,24134,www.aao.org,https://www.aao.org/diagnose-this/diagnose-thi...,news
16028,23956,ntp.niehs.nih.gov,https://ntp.niehs.nih.gov/results/pubs/longter...,publication
53054,78773,california.providence.org,https://california.providence.org/find-a-docto...,profile
23779,35904,www.samhealth.org,https://www.samhealth.org/patient-visitors/fin...,profile
39121,58506,stories.abbvie.com,https://stories.abbvie.com/stories/fear-factor...,others
19658,29543,bhiva.org,http://bhiva.org/AboutBHIVA.aspx,others
31439,46851,www.netmums.com,https://www.netmums.com/coffeehouse/advice-sup...,forum
52643,78201,www.hopkinsmedicine.org,https://www.hopkinsmedicine.org/profiles/resul...,profile
1410,2304,www.nature.com,http://www.nature.com/natureevents/science/eve...,news


In [3]:
print(test_df.shape)
test_df.sample(10)

(25787, 3)


Unnamed: 0,Webpage_id,Domain,Url
12261,36314,www.uwhealth.org,https://www.uwhealth.org/findadoctor/profile/d...
23023,70989,www.jacionline.org,http://www.jacionline.org/article/S0091-6749(1...
4162,12897,www.onvia.com,https://www.onvia.com/
18847,57077,labeling.pfizer.com,http://labeling.pfizer.com/ShowLabeling.aspx?i...
3778,11405,www.otcmarkets.com,https://www.otcmarkets.com/stock/HEMP/news/Hem...
9802,29173,apt.rcpsych.org,http://apt.rcpsych.org/
14628,44050,www.carefertility.com,https://www.carefertility.com/ivf/./viewtopic....
8203,24590,newsroom.biogen.com,http://newsroom.biogen.com/press-release/inves...
7678,23142,rsif.royalsocietypublishing.org,http://rsif.royalsocietypublishing.org/content...
8211,24621,thorax.bmj.com,http://thorax.bmj.com/content/72/9/803


It looks like "Tag" is my output variable. Good to know.

In [4]:
train_df.sample(10)

Unnamed: 0,Webpage_id,Domain,Url,Tag
13333,19954,proteomesci.biomedcentral.com,https://proteomesci.biomedcentral.com/articles...,publication
31438,46850,www.netmums.com,https://www.netmums.com/coffeehouse/advice-sup...,forum
7990,11944,www.prweb.com,http://www.prweb.com/releases/2017/02/prweb140...,news
14939,22356,www.journalagent.com,http://tjtes.org/eng/jvi.aspx?un=UTD-69077,publication
17931,26827,www.pharmavoice.com,http://www.pharmavoice.com/newsreleases/teckro...,news
26877,40468,sydney.edu.au,https://sydney.edu.au/accessibility.html,others
27937,41900,www.compusystems.com,https://www.compusystems.com/servlet/AttendeeR...,others
3946,5895,community.beatingbowelcancer.org,http://community.beatingbowelcancer.org/forum/...,forum
48629,71942,academiccommons.columbia.edu,https://academiccommons.columbia.edu/catalog/a...,thesis
44777,66836,www.news.bayer.com,http://www.news.bayer.com/baynews/baynews.nsf/...,profile


I will need to get the html_data from the html_data file so I can start working on the features stuck in within it.

The html_data file is quite large, I'll have to write a function to grab only the indices I want and merge them with the train table.

In [5]:
html_data.get_chunk()

Unnamed: 0,Webpage_id,Html
0,1,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
1,2,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
2,3,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
3,4,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
4,5,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
5,6,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
6,7,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
7,8,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
8,9,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
9,10,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."


In [6]:
def challenge_pipeline(in_df,train=True):
    in_df.loc[:'domain_type'] = in_df['Domain'].str.rsplit('.',1).str[-1]
    in_df = get_html(in_df)

def get_html(in_df):
    reader_obj = pd.read_csv(data_dir+'html_data.csv',iterator=True, chunksize=10000)
    frames = []
    match_indices = in_df['Webpage_id'].values.tolist()
    print(len(match_indices),' indices left...')
    while len(match_indices) > 0:
        for chunk in reader_obj:
            merge_df = pd.merge(in_df,chunk,how='inner',on='Webpage_id')
            merge_indices = merge_df['Webpage_id'].values.tolist()
            match_indices = [x for x in match_indices if x not in merge_indices]
            print(len(match_indices),' indices left...')
            frames.append(merge_df)
    return pd.concat(frames)

In [7]:
train_df_merged = get_html(train_df)
print(train_df_merged.shape)
train_df_merged.sample(10)

53447  indices left...
46616  indices left...
40091  indices left...
33466  indices left...
26910  indices left...
20084  indices left...
13419  indices left...
6322  indices left...
0  indices left...
(53447, 5)


Unnamed: 0,Webpage_id,Domain,Url,Tag,Html
4213,36459,www.tandfonline.com,http://www.tandfonline.com/doi/abs/10.1080/000...,publication,"<!DOCTYPE html>\n<html lang=""en"" class=""pb-pag..."
4739,77110,community.hrblock.com,http://community.hrblock.com,others,"<!DOCTYPE html><html prefix=""og: http://ogp.me..."
6458,49495,investor.baxter.com,http://investor.baxter.com/phoenix.zhtml?Event...,conferences,\r\n<!doctype html>\r\n\r\n<!--[if lte IE 9]> ...
832,31304,gnomad.broadinstitute.org,http://gnomad.broadinstitute.org/terms,others,<!doctype html>\n<head>\n <title>gnomAD bro...
2555,73372,ebn.bmj.com,http://ebn.bmj.com/content/9/2/52.long,publication,"<?xml version=""1.0"" encoding=""UTF-8""?><!DOCTYP..."
45,50104,www.allergan.com,https://www.allergan.com/News/CEO-Blog/August-...,news,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
3238,35036,blogs.biomedcentral.com,http://blogs.biomedcentral.com/bmcblog/2017/08...,news,<!doctype html>\n\n<!--[if lt IE 7]>\n<html la...
902,21357,www.sjweh.fi,http://www.sjweh.fi/show_abstract.php?abstract...,publication,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T..."
2539,73356,ebn.bmj.com,http://ebn.bmj.com/content/10/4/112.long,publication,"<?xml version=""1.0"" encoding=""UTF-8""?><!DOCTYP..."
3560,55352,www.vectorcontrol.bayer.com,https://www.vectorcontrol.bayer.com/Media-and-...,others,\r\n<!DOCTYPE html>\r\n<!--[if lt IE 7]> ...


Well that works nicely, lets start feature engineering!

There's an easy first feature in Domain, which is the domain type (.org, .com etc). Lets pull that out. Some of the tags at the end are multi-parters. Turns out there is a really nice little package that can parse domain names for me. It's called tldextract!

In [8]:
import tldextract
train_df_merged['domain_sub'] = train_df_merged['Domain'].apply(lambda x: tldextract.extract(x).subdomain)
train_df_merged['domain_main'] = train_df_merged['Domain'].apply(lambda x: tldextract.extract(x).domain)
train_df_merged['domain_suffix'] = train_df_merged['Domain'].apply(lambda x: tldextract.extract(x).suffix)
train_df_merged.sample(10)

Unnamed: 0,Webpage_id,Domain,Url,Tag,Html,domain_sub,domain_main,domain_suffix
4764,66854,www.annualreport2016.bayer.com,http://www.annualreport2016.bayer.com/financia...,profile,<!doctype html>\n<!--[if lt IE 7 ]> <html lang...,www.annualreport2016,bayer,com
2204,13244,ipp-ean17.netkey.at,https://ipp-ean17.netkey.at/index.php?p=record...,conferences,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n ...",ipp-ean17,netkey,at
3775,5724,rollitup.org,http://rollitup.org/t/police-break-the-law-mar...,forum,"<!DOCTYPE html>\n<html id=""XenForo"" lang=""en-U...",,rollitup,org
129,60248,blogs.novonordisk.com,http://blogs.novonordisk.com/graduates/tag/ass...,others,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...",blogs,novonordisk,com
2010,33281,nebraskalegislature.gov,http://nebraskalegislature.gov/agencies/view.php,others,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n ...",,nebraskalegislature,gov
1228,31921,jcb.rupress.org,http://jcb.rupress.org/about,others,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" \n ...",jcb,rupress,org
1549,32445,labdish.cshl.edu,http://labdish.cshl.edu/2017/09/15/what-silico...,others,"<!DOCTYPE html>\r\n<html lang=""en-US"" prefix=""...",labdish,cshl,edu
1675,22500,www.cmaj.ca,http://www.cmaj.ca/content/150/5/669.reprint,publication,"<!DOCTYPE html\n PUBLIC ""-//W3C//DTD XHTML 1....",www,cmaj,ca
3188,34929,ki.mit.edu,https://ki.mit.edu/people/cif/past/shaw,profile,"<!doctype html>\n<html>\n<head lang=""en"">\n<me...",ki,mit,edu
4235,45975,www.ivfconnections.com,http://www.ivfconnections.com/forums/entry.php...,forum,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 T...",www,ivfconnections,com


Now how about that Html!?! Html is usually a mess, so I'm going to lean on beautiful soup to help me out.

In [9]:
from bs4 import BeautifulSoup
import nltk
from nltk import wordpunct_tokenize
from nltk.stem.snowball import EnglishStemmer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
#train_df_merged['Html_soup'] = train_df_merged['Html'].apply(lambda x: BeautifulSoup(x,'html.parser'))
test_soup = BeautifulSoup(train_df_merged.iloc[0]['Html'],'html.parser')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jake\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [10]:
print(test_soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# " xmlns:article="http://ogp.me/ns/article#" xmlns:book="http://ogp.me/ns/book#" xmlns:product="http://ogp.me/ns/product#" xmlns:profile="http://ogp.me/ns/profile#" xmlns:video="http://ogp.me/ns/video#">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({1:[functi




Looking at this goop, I think some easy starter text to try and extract is the title of the article and the body text (annotated by "< p >" tags).

In [12]:
#After we use get_text, use nltk's clean_html function.
def nltkPipe(soup_text):
    #Convert to tokens
    tokens = [x.lower() for x in wordpunct_tokenize(soup_text)]
    text = nltk.Text(tokens)
    #Get lowercase words. No single letters, and no stop words
    words = [w.lower() for w in text if w.isalpha() and len(w) > 1 and w.lower() not in stop_words]
    #Remove prefix/suffixes to cut down on vocab
    stemmer = EnglishStemmer()
    words_nostems = [stemmer.stem(w) for w in words]
    return words_nostems

def getHTMLTitleTokens(html):
    soup = BeautifulSoup(html,'html.parser')
    soup_title = soup.title
    if soup_title != None:
        soup_title_text = soup.title.get_text()
        text_arr = nltkPipe(soup_title_text)
        return text_arr
    else:
        return []

getHTMLTitleTokens(train_df_merged.iloc[0]['Html'])

tqdm_notebook().pandas(desc='Testing title tokens...')
train_df_merged.sample(100)['Html'].progress_apply(getHTMLTitleTokens)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Testing title tokens...'), HTML(value='')))




2799    [obsess, compuls, disord, role, gp, australian...
264                               [elsevi, articl, locat]
4988    [safeti, studi, gene, modifi, donor, cell, inf...
5635                                               [sign]
1584    [bioluminesc, base, detect, brain, immun, cell...
5461    [differ, health, care, seek, behaviour, rural,...
90      [increas, seroreact, herv, peptid, patient, ht...
333     [mithramycin, hypoglycemia, malign, insulinoma...
5537    [extens, adipocyt, matur, seen, myxoid, liposa...
289                                   [life, daili, burn]
5689                [general, staff, directori, rle, mit]
4151    [ucsf, helen, diller, famili, comprehens, canc...
415                  [中国临床试验注册中心, 世界卫生组织国际临床试验注册平台一级注册机构]
2475    [artpriz, juror, shortlist, final, photo, gall...
5389    [sanofi, present, dr, paul, chew, morningstar,...
1627    [bailli, lumber, implement, tradesharp, platform]
3176         [koch, institut, herman, eisen, news, video]
6219    [world

In [13]:
def getBodyTokens(html):
    soup = BeautifulSoup(html,'html.parser')
    #Get the text body
    soup_para = soup.find_all('p')
    soup_para_clean = ' '.join([x.get_text() for x in soup_para if x.span==None and x.a==None])
    text_arr = nltkPipe(soup_para_clean)
    return text_arr

tqdm_notebook().pandas(desc='Testing body tokens...')
train_df_merged.sample(1)['Html'].progress_apply(getBodyTokens)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Testing body tokens...', max=1), HTML(value='')))




2027    [rememb, current, releas, guidelin, guidelin, ...
Name: Html, dtype: object

So now we have a way to tokenize the body text and title text of each html webpage. I want to employ a bag of words approach to generate weighted importance measures for each word found in the doc or title.

Lets get each sample's tokens first.

In [14]:
tqdm_notebook().pandas(desc='Getting title tokens...')
train_df_merged['title_tokens'] = train_df_merged['Html'].progress_apply(getHTMLTitleTokens)
tqdm_notebook().pandas(desc='Getting body tokens...')
train_df_merged['body_tokens'] = train_df_merged['Html'].progress_apply(getBodyTokens)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Getting title tokens...', max=53447), HTML(value='')))




HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




HBox(children=(IntProgress(value=0, description='Getting body tokens...', max=53447), HTML(value='')))




In [16]:
train_df_merged.to_csv(data_dir+"train_with_tokens.csv",index=False)