# Innoplexus Challenge

The task of this challenge is to classify web-pages based on their text content.

There are 9 classes I must bin things into;

1. People profile
2. Conferences/Congress
3. Forums
4. News article
5. Clinical trials
6. Publication
7. Thesis
8. Guidelines
9. Others

### The Data
I have a train dataframe;

|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Domain|Domain|
|Url|Complete Url|
|Tag|(Target) Tag (Class) of the Web page|

I have an html_data dataframe;


|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Html|Web page data in HTML|

I have a test dataframe;

|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Domain|Domain|
|Url|Complete Url|

And I have sample submissions;

|Variable|Definition|
|-|-|
|Webpage_id|Unique ID for the Web page|
|Tag|(Target) Tag (Class) of the Web page|

### Evaluation
At the end I will be evaluated based on f1 score.

### Approach
I will;
1. Mine the html_data set for features to populate my training data
2. Run a small TPOT run to determine a good choice for a model
3. Train
4. Predict
5. Submit!

# Data Mining

In [202]:
import pandas as pd
from tqdm import tqdm_notebook
data_dir = "../data/2018-08-10_AV_Innoplexus/"
html_data = pd.read_csv(data_dir+'html_data.csv',iterator=True, chunksize=1000)
sample_submission = pd.read_csv(data_dir+"sample_submission.csv",iterator=True, chunksize=1000)
train_df = pd.read_csv(data_dir+'train.csv')
test_df = pd.read_csv(data_dir+'test.csv')

In [2]:
print(train_df.shape)
train_df.sample(10)

(53447, 4)


Unnamed: 0,Webpage_id,Domain,Url,Tag
41464,62031,www.bayer.com,https://www.bayer.com/en/Google-Search.aspx,others
2852,4067,cris.nih.go.kr,https://cris.nih.go.kr/cris/en/search/search_r...,clinicalTrials
25626,38592,www.rle.mit.edu,http://www.rle.mit.edu/people/directory/david-...,profile
52424,77916,ombud.mit.edu,http://ombud.mit.edu/links,others
36844,55252,beecare.bayer.com,https://beecare.bayer.com/what-to-know/pesticides,others
23469,35517,asianderm.org,http://asianderm.org/events.html,conferences
48552,71865,academiccommons.columbia.edu,https://academiccommons.columbia.edu/catalog/a...,thesis
49417,72942,actaneurocomms.biomedcentral.com,https://actaneurocomms.biomedcentral.com/artic...,publication
4698,7064,bmcmedgenet.biomedcentral.com,http://bmcmedgenet.biomedcentral.com/articles/...,publication
36295,54384,www.medicaljournals.se,https://www.medicaljournals.se/acta/instructio...,others


In [3]:
print(test_df.shape)
test_df.sample(10)

(25787, 3)


Unnamed: 0,Webpage_id,Domain,Url
23323,72813,www.ensaiosclinicos.gov.br,http://www.ensaiosclinicos.gov.br/rg/RBR-3h3x4...
16886,50818,investors.gilead.com,http://investors.gilead.com/phoenix.zhtml?ID=1...
3788,11417,www.europsy-journal.com,http://www.europsy-journal.com/article/S0924-9...
7419,22367,www.parasitol.or.kr,http://parasitol.kr/journal/view.php?id=10.334...
13898,41741,validator.w3.org,http://validator.w3.org/check?uri=http%3A%2F%2...
2866,9093,climatechange.conferenceseries.com,http://climatechange.conferenceseries.com/asia...
11119,32959,metalslab.mit.edu,http://metalslab.mit.edu
23575,73339,www.internationalsurgery.org,http://www.internationalsurgery.org/doi/10.973...
16066,48398,patient.info,https://patient.info/forums/discuss/can-having...
25699,79127,www.abingtonhealth.org,https://www.abingtonhealth.org/find-a-doctor/p...


It looks like "Tag" is my output variable. Good to know.

In [4]:
train_df.sample(10)

Unnamed: 0,Webpage_id,Domain,Url,Tag
36682,54997,www.annualreport2014.bayer.com,http://www.annualreport2014.bayer.com/en/cash-...,others
19556,29430,avid.force.com,http://avid.force.com/pkb/articles/faq/Pro-Too...,others
20127,30306,cco.amegroups.com,http://cco.amegroups.com/announcement/view/338,others
12835,19263,ij-healthgeographics.biomedcentral.com,https://ij-healthgeographics.biomedcentral.com...,publication
18630,28087,agelab.mit.edu,http://agelab.mit.edu/people,others
29204,43649,www.babymed.com,https://www.babymed.com/fertility-news/acupunc...,forum
38883,58267,www.boehringer-ingelheim.com,https://www.boehringer-ingelheim.com/locations...,others
22267,33629,nzshs.org,http://nzshs.org/membershipmain/membership,others
6802,9956,www.ctreia.com,https://www.ctreia.com/events/view?eI=254,news
936,1617,www.arthrex.com,https://www.arthrex.com/newsroom/product-updat...,news


I will need to get the html_data from the html_data file so I can start working on the features stuck in within it.

The html_data file is quite large, I'll have to write a function to grab only the indices I want and merge them with the train table.

In [5]:
html_data.get_chunk()

Unnamed: 0,Webpage_id,Html
0,1,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
1,2,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
2,3,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
3,4,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
4,5,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
5,6,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
6,7,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
7,8,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
8,9,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."
9,10,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" xml..."


In [6]:
def challenge_pipeline(in_df,train=True):
    in_df.loc[:'domain_type'] = in_df['Domain'].str.rsplit('.',1).str[-1]
    in_df = get_html(in_df)

def get_html(in_df):
    reader_obj = pd.read_csv(data_dir+'html_data.csv',iterator=True, chunksize=10000)
    frames = []
    match_indices = in_df['Webpage_id'].values.tolist()
    print(len(match_indices),' indices left...')
    while len(match_indices) > 0:
        for chunk in reader_obj:
            merge_df = pd.merge(in_df,chunk,how='inner',on='Webpage_id')
            merge_indices = merge_df['Webpage_id'].values.tolist()
            match_indices = [x for x in match_indices if x not in merge_indices]
            print(len(match_indices),' indices left...')
            frames.append(merge_df)
    return pd.concat(frames)

In [7]:
train_df_merged = get_html(train_df)
print(train_df_merged.shape)
train_df_merged.sample(10)

53447  indices left...
46616  indices left...
40091  indices left...
33466  indices left...
26910  indices left...
20084  indices left...
13419  indices left...
6322  indices left...
0  indices left...
(53447, 5)


Unnamed: 0,Webpage_id,Domain,Url,Tag,Html
2435,33842,pfizer.com,http://pfizer.com/careers/en/tips-interviewing...,others,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"">\n..."
4500,46260,www.invitra.com,https://www.invitra.com/forums/topic/slow-grow...,forum,"<!DOCTYPE html><html lang=""en-US"" prefix=""og: ..."
2056,3237,www.isrctn.com,http://www.isrctn.com/ISRCTN17263619?q=&filter...,clinicalTrials,"\n\n\n\n\n<!DOCTYPE html>\n\n<html lang=""en"" c..."
110,50189,investor.jazzpharma.com,http://investor.jazzpharma.com/phoenix.zhtml?I...,news,\r\n<!DOCTYPE html >\r\n\r\n<!--[if lt IE 7 ]>...
1991,52969,analysis.pharmaceuticalconferences.com,https://analysis.pharmaceuticalconferences.com...,others,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n<me..."
4713,7083,jnanobiotechnology.biomedcentral.com,http://jnanobiotechnology.biomedcentral.com/ar...,publication,<!DOCTYPE html>\n<!--[if lte IE 8 ]><html lang...
6092,8880,gastrocancer.conferenceseries.com,http://gastrocancer.conferenceseries.com,conferences,<!-- - --><!--/--><!--730--><!DOCTYPE html>\r\...
4352,56305,www.gsk.com,http://www.gsk.com/en-gb/contact-us/worldwide/...,others,\r\n\r\n<!doctype html>\r\n<!--[if IE 9]> <htm...
2043,42817,www.comtecint.com,http://www.comtecint.com/division/comtecmed/,others,"<!DOCTYPE html>\n<html lang=""en-US"" prefix=""og..."
3852,55678,ehp.niehs.nih.gov,https://ehp.niehs.nih.gov/1103645/,others,<!doctype html>\r\n<!--[if lt IE 7]> <html cla...


Well that works nicely, lets start feature engineering!

There's an easy first feature in Domain, which is the domain type (.org, .com etc). Lets pull that out. Some of the tags at the end are multi-parters. Turns out there is a really nice little package that can parse domain names for me. It's called tldextract!

In [8]:
import tldextract
train_df_merged['domain_sub'] = train_df_merged['Domain'].apply(lambda x: tldextract.extract(x).subdomain)
train_df_merged['domain_main'] = train_df_merged['Domain'].apply(lambda x: tldextract.extract(x).domain)
train_df_merged['domain_suffix'] = train_df_merged['Domain'].apply(lambda x: tldextract.extract(x).suffix)
train_df_merged.sample(10)

Unnamed: 0,Webpage_id,Domain,Url,Tag,Html,domain_sub,domain_main,domain_suffix
7071,69954,www.nsf.gov,https://www.nsf.gov/news/news_summ.jsp?cntn_id...,news,\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\...,www,nsf,gov
2763,54096,transparency.bayer.com,http://transparency.bayer.com/ES?language=en,others,"<!DOCTYPE html>\n<html lang=""en"">\n<head>\n ...",transparency,bayer,com
625,40828,rnao.ca,http://rnao.ca/connect/workplace-liaisons,others,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S...",,rnao,ca
5024,27535,www.euro.who.int,http://www.euro.who.int/en/health-topics/disea...,news,\n<!DOCTYPE html>\n\n\t\t\t\t\t\t\t\t\t\t\t\t\...,www.euro,who,int
6632,69376,rctportal.niph.go.jp,http://rctportal.niph.go.jp/en/detail?trial_id...,clinicalTrials,"<!DOCTYPE html>\n<html lang=""ja"">\n<head>\n<me...",rctportal,niph,go.jp
5270,7682,ecommons.cornell.edu,https://ecommons.cornell.edu/handle/1813/11056,thesis,<!DOCTYPE html>\n <!--[if l...,ecommons,cornell,edu
6299,9206,spine.conferenceseries.com,http://spine.conferenceseries.com,conferences,<!-- - --><!--/--><!--1412--><!DOCTYPE html>\r...,spine,conferenceseries,com
5943,68566,www.globenewswire.com,http://www.amagpharma.com/,news,"<!DOCTYPE html>\n<html lang=""en-US"">\n<head>\n...",www,globenewswire,com
2339,53457,msystems.asm.org,http://msystems.asm.org/content/by/year,others,"<!DOCTYPE html>\n<html lang=""en"" dir=""ltr"" \n ...",msystems,asm,org
837,41123,space.mit.edu,http://space.mit.edu/ligo-detects-merging-blac...,others,<!DOCTYPE html>\n<!--[if IEMobile 7]><html cla...,space,mit,edu


Now how about that Html!?! Html is usually a mess, so I'm going to lean on beautiful soup to help me out.

In [118]:
from bs4 import BeautifulSoup
import nltk
from nltk import wordpunct_tokenize
from nltk.stem.snowball import EnglishStemmer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
#train_df_merged['Html_soup'] = train_df_merged['Html'].apply(lambda x: BeautifulSoup(x,'html.parser'))
test_soup = BeautifulSoup(train_df_merged.iloc[0]['Html'],'html.parser')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jdber\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jdber\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [104]:
print(test_soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# " xmlns:article="http://ogp.me/ns/article#" xmlns:book="http://ogp.me/ns/book#" xmlns:product="http://ogp.me/ns/product#" xmlns:profile="http://ogp.me/ns/profile#" xmlns:video="http://ogp.me/ns/video#">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(e,n,t){function r(t){if(!n[t]){var o=n[t]={exports:{}};e[t][0].call(o.exports,function(n){var o=e[t][1][n];return r(o||n)},o,o.exports)}return n[t].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r}({1:[functi




Looking at this goop, I think some easy starter text to try and extract is the title of the article and the body text (annotated by "< p >" tags).

In [156]:
#After we use get_text, use nltk's clean_html function.
def nltkPipe(soup_text):
    #Convert to tokens
    tokens = [x.lower() for x in wordpunct_tokenize(soup_text)]
    text = nltk.Text(tokens)
    #Get lowercase words. No single letters, and no stop words
    words = [w.lower() for w in text if w.isalpha() and len(w) > 1 and w.lower() not in stop_words]
    #Remove prefix/suffixes to cut down on vocab
    stemmer = EnglishStemmer()
    words_nostems = [stemmer.stem(w) for w in words]
    return words_nostems

def getHTMLTitleTokens(html):
    soup = BeautifulSoup(html,'html.parser')
    soup_title = soup.title
    if soup_title != None:
        soup_title_text = soup.title.get_text()
        text_arr = nltkPipe(soup_title_text)
        return text_arr
    else:
        return []

getHTMLTitleTokens(train_df_merged.iloc[0]['Html'])

train_df_merged.sample(100)['Html'].progress_apply(getHTMLTitleTokens)




my bar!:   0%|          | 0/100 [00:00<?, ?it/s]


my bar!:   6%|▌         | 6/100 [00:00<00:01, 53.23it/s]


my bar!:   8%|▊         | 8/100 [00:00<00:06, 14.20it/s]


my bar!:  11%|█         | 11/100 [00:00<00:05, 16.53it/s]


my bar!:  16%|█▌        | 16/100 [00:00<00:04, 19.63it/s]


my bar!:  19%|█▉        | 19/100 [00:00<00:03, 21.17it/s]


my bar!:  22%|██▏       | 22/100 [00:00<00:03, 21.96it/s]


my bar!:  25%|██▌       | 25/100 [00:01<00:04, 17.54it/s]


my bar!:  30%|███       | 30/100 [00:01<00:03, 21.05it/s]


my bar!:  33%|███▎      | 33/100 [00:01<00:03, 21.54it/s]


my bar!:  39%|███▉      | 39/100 [00:01<00:02, 26.44it/s]


my bar!:  43%|████▎     | 43/100 [00:01<00:02, 22.24it/s]


my bar!:  47%|████▋     | 47/100 [00:01<00:02, 25.43it/s]


my bar!:  53%|█████▎    | 53/100 [00:02<00:01, 30.20it/s]


my bar!:  58%|█████▊    | 58/100 [00:02<00:01, 33.53it/s]


my bar!:  63%|██████▎   | 63/100 [00:02<00:01, 25.10it/s]


my bar!:  67%|██████▋   | 67/100 [00:02<00:01, 2

1581    [anorexia, cachexia, syndrom, lung, cancer, ra...
6504                                     [patient, abbvi]
3112                                       [record, view]
2216    [nvbdcp, nation, vector, born, diseas, control...
1574    [resist, starch, larg, bowel, ferment, broader...
3273    [tell, tale, heart, molecular, cellular, respo...
3245                          [neurotalk, support, group]
5144                                     [joseph, gordon]
2244                             [epa, nice, group, book]
2459                       [bmc, famili, practic, articl]
3445    [fertil, lab, insid, lesson, learn, fifteen, y...
2389                      [cju, intern, articl, abstract]
4643    [explor, career, job, opportun, intuit, intuit...
1313    [divis, endocrinolog, overview, boston, childr...
5572          [journal, compassion, health, care, articl]
2063    [clinic, practic, guidelin, healthi, eat, prev...
3415    [comparison, convent, intracytoplasm, sperm, i...
5394    [look,

In [198]:
def getBodyTokens(html):
    soup = BeautifulSoup(html,'html.parser')
    #Get the text body
    soup_para = soup.find_all('p')
    soup_para_clean = ' '.join([x.get_text() for x in soup_para if x.span==None and x.a==None])
    text_arr = nltkPipe(soup_para_clean)
    return text_arr

train_df_merged.sample(1)['Html'].progress_apply(getBodyTokens)









my bar!:   0%|          | 0/1 [00:00<?, ?it/s]







my bar!: 100%|██████████| 1/1 [00:00<00:00, 45.52it/s]

1705    [publish, jlb, know]
Name: Html, dtype: object

So now we have a way to tokenize the body text and title text of each html webpage. I want to employ a bag of words approach to generate weighted importance measures for each word found in the doc or title.

Lets get each sample's tokens first.

In [205]:
tqdm_notebook().pandas(desc='Getting title tokens...')
train_df_merged['title_tokens'] = train_df_merged['Html'].progress_apply(getHTMLTitleTokens)
tqdm_notebook().pandas(desc='Getting body tokens...')
train_df_merged['body_tokens'] = train_df_merged['Html'].progress_apply(getBodyTokens)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))

HBox(children=(IntProgress(value=0, description='Getting title tokens...', max=53447), HTML(value='')))

No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!
No title!


KeyboardInterrupt: 

In [None]:
train_df_merged.to_csv("train_with_tokens.csv",index=False)