# Webscraping and Clustering with Juno

Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.

In this notebook, we will show how we used the programming language Julia in order to:
1. Parse text from a website
2. Perform text clustering, an unsupervised machine learning technique

# 0. Installing Julia

In order to install Julia on your computer, visit http://julialang.org/downloads/ and follow instructions.

You can use Julia online, in a browser. JuliaBox provides online IJulia notebooks, which let you run Julia on a remote machine, using Jupyter (formerly called IPython) interactive notebooks.

# 1. Webscraping with Julia

Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. We will try to use several web scraping libraries or modules, some of them written for the Julia language, and some of them import from other programming languages like Python.

In order for us to be able to use a Python library, we need to type in the Julia console Pkg.add("PyCall")

#1.1 Requests.jl

Requests.jl is an HTTP client written in Julia. Read more about it at https://github.com/JuliaWeb/Requests.jl.

In order for us to be able to use a Python library, we need to type in the Julia console Pkg.add("Requests")

Unfortunately, the library does not seem to work.

In [19]:
using Requests
url = "http://lavica.fesb.unist.hr/mat1/"
firstrequest = get(url)

LoadError: LoadError: ArgumentError: Requests not found in path
while loading In[19], in expression starting on line 1

#1.2 HTTPClient.jl 

Provides HTTP client functionality based on libcurl. Seems it has been deprecated in favour of Requests.jl:

https://github.com/JuliaWeb/HTTPClient.jl/issues/21

In order for us to be able to use a Python library, we need to type in the Julia console Pkg.add("HTTPClient"), Pkg.add("URIParser") and Pkg.add("Gumbo").

After trying to run the module, several problems arise.



In [8]:
using HTTPClient.HTTPC
using URIParser
using Gumbo
 
#callback for HTTPC.get, allow to use libCURL options
function customize_curl(curl)
  cc = LibCURL.curl_easy_setopt(curl, LibCURL.CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; rv:28.0) Gecko/20100101 Firefox/28.0")
  if cc != LibCURL.CURLE_OK
    error ("CURLOPT_USERAGENT failed: " * LibCURL.bytestring(curl_easy_strerror(cc)))
  end
end
 
function getPage(url::String;debug=true)
 
    try
 
        r = HTTPC.get(url,RequestOptions(
                        request_timeout=8.0,
                        callback=customize_curl
                    ))
 
        if r.http_code != 200
            code = r.http_code
            if debug
                warn("couldn't read url : $url, HTTP code : $code")
            end
            return (false,"")
        end
 
        page = bytestring(r.body)
 
        return (true, page)
 
    catch err
 
        if debug
            println(err)
        end
        return (false,"")
    end
end
 
function getBody(doc)
 
    body = HTMLElement(:body)
    for elem in preorder(doc.root)
        if typeof(elem) == HTMLElement{:body}
            body = elem
            break
        end
    end
 
    return body
 
end
 
function parsePage(page)
 
    doc = parsehtml(page)
    body = getBody(doc)
 
    postUrls = String[]
    titles = String[]
    nextUrl = ""
 
    for elem in preorder(body)
        if typeof(elem) == HTMLElement{:a}
 
            try
                as = attrs(elem)
                if haskey(as,"class") && as["class"] == "title may-blank "
                    push!(titles, lowercase(elem[1].text) )
                    push!(postUrls,  as["href"])
                end
 
                if haskey(as,"rel") && as["rel"] == "nofollow next"
 
                    nextUrl = as["href"]
                end
            end
 
        end
    end
 
    return postUrls, titles, nextUrl
 
end



parsePage (generic function with 1 method)

Base.String is deprecated, use AbstractString instead.
  likely near In[8]:13
  likely near In[8]:13
  likely near In[8]:13


#1.3 Urllib2

We will try using a Python web scraping module in Julia. In order to do so, we first need to add the PyCall package in the Julia console: Pkg.add("PyCall")

The urllib2 module defines functions and classes which help in opening URLs (mostly HTTP).

Using urllib2 we can easily access the content of an url.

In [4]:
using PyCall
@pyimport urllib2 as urllib2
req = pycall(urllib2.Request, PyAny, "http://lavica.fesb.unist.hr/mat1/")
html_doc=urllib2.urlopen(req)
html_doc[:read](500)

INFO: Recompiling stale cache file C:\Users\Ja\.julia\lib\v0.4\PyCall.ji for module PyCall.


"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">\n<html>\n<head>\n<title> Matematika 1 </title> \n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<link rel=\"shortcut icon\" href=\"favicon.ico\"/>\n</HEAD>\n<frameset rows=\"46,50,*\" border=0>\n<frame src=\"banner.html\" name=\"Banner\" scrolling=\"no\">\n<frame src=\"f_mat1.html\" name=\"Upravljanje\" scrolling=\"no\">\n<frame src=\"osnovna.html\" name=\"Tekst\">\n</frameset>\n</html>\n"

#1.4 BeautifulSoup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

- Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application
- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.
- Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility. 

The module does not seem to work.

In [8]:
@pyimport bs4 
@pyimport BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

LoadError: LoadError: syntax: invalid character literal
while loading In[8], in expression starting on line 4

#1.5 LibCURL

In [9]:
# Kod s Githuba
using LibCURL
# init a curl handle
curl = curl_easy_init()

# set the URL and request to follow redirects
curl_easy_setopt(curl, CURLOPT_URL, "http://example.com")
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1)
# setup the callback function to recv data
function curl_read_cb(curlbuf::Ptr{Void}, s::Csize_t, n::Csize_t, p_ctxt::Ptr{Void})
    sz = s * n
    data = Array(UInt8, sz)

    ccall(:memcpy, Ptr{Void}, (Ptr{Void}, Ptr{Void}, UInt64), curlbuf, data, sz)
    println("recd: ", bytestring(data))

    sz::Csize_t
end

c_curl_read_cb = cfunction(curl_read_cb, Csize_t, (Ptr{Void}, Csize_t, Csize_t, Ptr{Void}))
curl_easy_setopt(curl, CURLOPT_READFUNCTION, c_curl_read_cb)


# execute the query
res = curl_easy_perform(curl)
println("curl url exec response : ", res)

# retrieve HTTP code
http_code = Array(Clong,1)
curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE, http_code)
println("httpcode : ", http_code)

# release handle
curl_easy_cleanup(curl)

curl url exec response : 0
httpcode : Int32[200]


In [10]:
# Kod iz knjige Mastering Julia
const CURLOPT_URL = 10002
const CURLOPT_FOLLOWLOCATION = 52;
const CURLE_OK = 0
jlo = "http://julialang.org";
curl = ccall( (:curl_easy_init, "libcurl"), Ptr{Uint8}, ())
ccall((:curl_easy_setopt, "libcurl"), Ptr{Uint8},
(Ptr{Uint8}, Int, Ptr{Uint8}), curl, CURLOPT_URL, jlo.data)
ccall((:curl_easy_perform,"libcurl"),
Ptr{Uint8}, (Ptr{Uint8},), curl)
ccall((:curl_easy_cleanup,"libcurl"),
Ptr{Uint8},(Ptr{Uint8},), curl);



LoadError: LoadError: error compiling anonymous: could not load library "libcurl"
The specified module could not be found.

while loading In[10], in expression starting on line 5

  likely near In[10]:5


#1.6 Parsing

Since we did not manage to find a module which allows for efficient parsing of html documents, we can use the replace function and regular expressions in order to remove the html tags from the html document we acquired in 1.3 Urllib2 (PyObject f).

In [16]:
#Funkcijom replace (na primitivniji način) mogli bismo se riješiti html tagova
s1 = "The quick brown brown fox jumps over the lazy dog α,β,γ"
r = replace(string(f), "menu", "menu1")
show(r); println()

"PyObject <addinfourl at 418455048L whose fp = <socket._fileobject object at 0x0000000018E9C9A8>>"


Since we acquired the html document using a Python module, it is currently stored as a PyObject. In order to convert it, we need to use the following instructions:

https://github.com/stevengj/PyCall.jl

or:

http://stackoverflow.com/questions/5356773/python-get-string-representation-of-pyobject

In [17]:
f1 = convert(T, o::f)

LoadError: LoadError: UndefVarError: T not defined
while loading In[17], in expression starting on line 2

We were not able to convert the PyObject into a string.

# 2. Clustering

Cluster analysis or clustering, a type of unsupervised machine learning, is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them.

In order to perform clustering in Julia, we could use Clustering.jl:

https://github.com/JuliaStats/Clustering.jl